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PREFACE TO THE GERMAN EDITION 


Les mathtmaticiens ont autanl besoin 
d’Stre philosophes que les philosophes 
d’etre mathtmaticiens. 

G. W. Leibniz 

(Lettre & Malebranche, mars, 1699) 

T iie philosophical problems of the concept of probability have time 
and again occupied the minds of philosophers and mathematicians. 
Recently they have been brought to the fore with even greater 
emphasis: first, because of the prominence of the concept of probability in 
modern physics, where it has gradually replaced the concept of causality; 
second, because of the development of the modern philosophy of nature, 
which has analyzed the concept of probability for its own sake. With the 
incorporation of the results of symbolic logic, the new philosophy of nature 
has developed, in the meantime, from a critical investigation of the thinking 
of mathematical physics to a scientific theory of knowledge. It has now 
reached the stage at which it begins to replace the era of metaphysical con¬ 
structions in philosophy by the establishment of a philosophical science. 
Abundant results have already been reached in the investigation of the space- 
time problem, in the logical criticism of mathematics, in the analysis of the 
problem of causality, and in the general criticism of scientific concepts. The 
problem of probability, however, has resisted with peculiar persistence all 
attempts at solution. Yet it has become obvious, as never before, that this 
problem, because of its intimate relation to the problem of induction, contains 
the nucleus of every theory of knowledge. 

The philosophical theory of the probability problem, more than any other, 
is based upon mathematical analysis. Previous attempts to solve the problem 
were bound to break down because the mathematical basis of the calculus of 
probability had not yet been developed in the rigorous form on which philo¬ 
sophical criticism depends. Philosophical analysis was the starting point for 
a new mathematical construction of the calculus of probability. Further 
inquiry showed that the new mathematical construction made possible the 
long-sought transition from two-valued logic to probability logic, that is, 
to a logic with a continuous scale of truth values. Thus a final theory of 
probability that satisfies both mathematical and logical requirements can 
now be presented. 

In a period in which the value of speculative philosophy is seriously ques¬ 
tioned, it will not be regarded as surprising that a philosophical theory expects 
new insights from the construction and elucidation of mathematical theories. 
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I hope to demonstrate that, conversely, the mathematical theory can be 
furthered by philosophical points of view. The unification of the calculus of 
probability with the calculus of logic that determines the form of this expo¬ 
sition and, furthermore, the inclusion of all problems of the theory of prob¬ 
ability in a general calculus of probability, as well as the theory of the order 
of probability sequences developed in this connection, seem to be results that 
may be of value to the mathematician also. He may be interested likewise 
in the continuous generalization of the concept of probability sequence, 
leading from the familiar geometrical probabilities to continuous probability 
sequences. The complete elaboration of the mathematical structure, which 
is necessary for the philosophical aim, makes this book at the same time a 
mathematical textbook on the calculus of probability. I hope that clarifica¬ 
tion of fundamental concepts will make the presentation particularly ade¬ 
quate for this purpose. Nothing makes the penetration into a new mathe¬ 
matical field more difficult than the improper treatment of fundamental 
concepts. But the so-called mathematical difficulties vanish when the mean¬ 
ings of the concepts whose relations make up the mathematical calculus are 
clarified. 

I was fortunate to have had the opportunity, since 1927, of repeatedly 
presenting the ideas of this book to my students at the University of Berlin. 
Many details and examples were developed in class discussion, and the 
congenial atmosphere greatly helped me in working out solutions. 

Only a few of the many names that I should like to mention may be in¬ 
cluded. I owe valuable help to C. Hempel, A. Becker, M. Strauss, O. Helmer, 
and my old friends K. Grelling and W. Dubislav. V. Bargmann assisted in 
the mathematical part and elaborated numerous details and proofs. I thank 
my friend R. Carnap of Prague for reading the introduction to symbolic 
logic. E. Kokott assisted in drawing the figures. C. Hempel and V. Bargmann 
also aided in the correction of proofs. 

Hans Reichenbach 

Istanbul, 

August, 1934 



PREFACE TO THE ENGLISH EDITION 


I T is now more than fifteen years since this book was written in the 
German version. During this time the theory of probability presented 
in it has been much discussed, and the attacks made upon it have given 
me the opportunity to think over its content again and again. I am grateful 
to my opponents because their criticism has compelled me to check every 
item of my theory. Many details have thus been improved in this presenta¬ 
tion. In particular, the exposition of the problems of application and induction 
has been rearranged and supplemented by many additions. The English 
version is therefore designated as the second edition of the book. 

But, on the whole, I found that I had nothing to change in my views. The 
theory has stood'up to the test of a critical discussion in all its fundamentals. 
The generalization of the customary theory, which identifies probability se¬ 
quences with random sequences, into a comprehensive system that embraces 
sequences of all types of order, including continuous probability sequences, 
has turned out to be an adequate integration of the pursuit of the mathe¬ 
matician. The theory of the application of probability concepts to physical 
reality and the analysis of the problem of induction have shown that an empir¬ 
icist solution of these problems can be given. It seems to me that the various 
attacks launched against this part of my theory have helped very much to 
make its significance clear. 

The attacks came mainly from two sides, which I should like to distinguish 
as the rationalists and the unshakable Humeans. Both agree so far as they 
regard my justification of induction as unsatisfactory, but they do so for 
different reasons. The rationalists find this justification too weak because 
they wish to find a stronger one, which in some way or other bases induction 
and probability on a rational belief. The unshakable Humeans also maintain 
that I did not justify induction; but they do so because they believe, like 
Hume, that any such justification is impossible and that Hume’s position 
represents the final stage in the analysis of the problem. The more I read of 
these arguments, the more convinced I became that my conception is correct; 
and I should like to explain in what points I see the merits of this conception. 

The first achievement, I think, consists in the demonstration that the 
frequency interpretation of probability can be carried through for all uses 
of the term ‘‘probable”. The difficulties that others had found in interpreting 
the probabilities of single events as limits of frequencies disappear when 
statements about single cases are regarded, not as assertions, but as posits. 
When, in the face of these results, there remain philosophers who maintain 
that there is a second concept of probability, which is not reducible to fre- 


[ vii] 



viii 


PREFACE TO THE ENGLISH EDITION 


quency notions, I must answer that there is no need for such a concept. 
Whoever wishes to reserve the right of using private meanings of a non- 
verifiable pattern, or of a structure useless for predictions, may do so. What 
has been shown, however, is that the frequency interpretation of probability 
leads to a meaning of the term that makes the usage of language conform to 
human behavior. I do not think there are other grounds that a philosophical 
theory could adduce for its claim to provide an adequate interpretation of 
terms. 

The second achievement I find in the fact that, when the frequency inter¬ 
pretation is used, all nondeductive methods of the calculus of probability 
reduce to one kind of inference: the inference of induction by enumeration. 
Since all inductive methods of science, including the theory of indirect evi¬ 
dence and the formation of scientific theories, are interpretable in terms of 
inferences supplied by the calculus of probability, this result establishes the 
thesis that all forms of inductive inference are reducible to one form, to the 
inference of induction by enumeration. This thesis was implicitly contained 
in some older forms of a theory of induction, in particular, in Hume's theory. 
But so far it had not been given a proof. 

As the third achievement, I should like to mention the fact that the theory 
of probability has been freed from all forms of a rational belief in synthetic 
statements, whether they appear in the form of a synthetic a priori , or an 
animal faith, or a belief in a uniformity of the world, or a principle of insuffi¬ 
cient reason. All such notions are remnants of a philosophy of rationalism, 
which holds that human reason has access to knowledge of the physical world 
by other means than sense observation. What has been shown is that a theory 
of probability can be built up without the use of such notions. There is no 
such thing as inductive self-evidence. It is unfortunate that this discovery 
of David Hume has so often been forgotten, and that so many attempts 
have been made to reintroduce inductive self-evidence in new forms whenever 
some older forms had been proved to be untenable. 

The fourth achievement of the theory I see in its successful justification 
of the inference of induction by enumeration, and therefore of all forms of 
inductive inference, in spite of its renunciation of arguments based on syn¬ 
thetic self-evidence. The inductive inference is regarded as permissible, not 
because it appears self-evident, but because it represents an instrument of 
prediction so devised that it must lead to success if success is attainable. 

This conception of a justification has been attacked by various means. 
Although it is now generally admitted that it would be asking too much to 
require a proof that the inductive inference must lead to true conclusions, 
other requirements for a justification were advanced with the intention of 
ruling out my form of a justification. For instance, a justification of a method 
has been defined as a proof that there exists some inductive evidence, or 
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probability, that the method will reach its aim. On the basis of such a defini¬ 
tion it is easy to prove that what I have given is not a justification of induction. 

It should be clear, however, that the argument I have set forth for the use 
of induction cannot be deprived of its rigor by narrowing down the meaning 
of words. That it is advisable to use a certain means if it can be shown to be 
the best means for a desired end, and if nothing is known about the attain¬ 
ability of this end, appears to be an argument exempt from doubt; and it 
seems to me that the word “justification” has always been applied to such 
situations. When Magellan sailed along the American shore with the intention 
of finding a western passage, he had no inductive evidence that there was 
one; but his enterprise was justified because it was a means to reach his aim 
if the aim was attainable. It is in this sense of the word that a justification 
of induction has been given. 

What makes this justification of induction appear weak is the adherence 
to rationalistic standards, the creed that we can do more for the finding of 
predictions than prepare everything for the case where success is attainable. 
I do not think that any such creed is compatible with the principles of em¬ 
piricism, nor that it is an indispensable backbone for action in this recalci¬ 
trant world. The search for certainty is a desire deeply rooted in human 
nature; yet we must not infer that we should submit to it. I prefer a philosophy 
that teaches us to walk without the crutches of any kind of faith. It is of 
the essence of empiricism that it refuses to recognize any form of rationalized 
belief. That such refusal need not lead to skepticism, that actions anticipating 
the future can be justified without any reference to belief, has been shown 
through my analysis of the problem of induction. I do not think that a justi¬ 
fication of the inductive method of prediction may be called weak if it proves 
that employing this method is the best we can do for the attainment of the 
aim. This proof is the result to which I came through the construction of my 
theory of probability and induction some fifteen years ago. It is with the 
intention of submitting this result to the judgment of the English-speaking 
world that I present this English edition of my book. 

Hans Reichenbach 

University of California, 

Los Angeles, 

May, 1948 



CHANGES AND ADDITIONS IN THE 
ENGLISH EDITION 


Throughout the book, numerous minor changes were made; but only the major changes 
will be mentioned. In chapter 1 the last paragraphs of § 2 were cut off and added to § 3, 
which was shortened in other parts. The introduction to symbolic logic, given in chapter 2, 
was abbreviated and adjusted to the results of a detailed exposition of symbolic logic given 
in another publication (see footnote, p. 12). In chapter 3 and the following chapters the 
symbolic notation was adapted to the English language; in particular, the German symbol 
W was replaced by the symbol P, standing for 4 ‘probability”. Some changes and additions 
were made in §§12 and 22. A new section was inserted as § 24; furthermore, the appendix 
to chapter 3, which contains exercises, was added. From §§ 1 to 23 the numbers of the 
English and the German sections coincide; from § § 25 to 46 the numbers of the English sections 
exceed by one those of the German sections. In chapter 4, § 30 was supplemented by refer¬ 
ences to recent publications on the problem of randomness. Chapters 5 and 6 remained 
unchanged, except for some additions in §45 and the addition of § 47. From §§48 to 62 
the numbers of the English sections exceed by two those of the German sections. In 
chapters 7 and 8 changes were made in §§ 49, 57, and 62. 

Chapters 9-11 take the place of the German chapters 9-10. These chapters were greatly 
changed in the order of the material presented, so that no simple rule for the comparison 
of sections can be given; and many changes and additions were made in the interior of the 
sections. The following sections of these chapters are new: §§ 70-74, 81-88. The appendix 
of the German edition, which contains some mathematical extensions and proofs, was 
omitted. 

The changes made in chapters 9-11 may be summarized as follows. A distinction is made 
between primitive knowledge and advanced knowledge, and the problems of induction and 
of the meaning of limit statements are given different treatments for the two cases. In 
advanced knowledge, which presupposes the use of previous inductions, both the inductive 
inference and the meaning of limit statements can be accounted for in terms of probability 
concepts. In primitive knowledge, which does not make use of previous inductions, the 
meaning of limit statements can be defined by the help of a “finitization”, which elim¬ 
inates infinite sequences; the use of infinite sequences then appears as a simplification 
introduced for technical reasons. The inductive inference, as far as it is applied in primitive 
knowledge, is made legitimate by my justification of induction. 

A further distinction is made between the object-language interpretation and the meta¬ 
language interpretation of probability. The latter is used for the construction of probability 
logic and is thus also called the logical interpretation. At the time the German book was 
written, little was known about the significance of this distinction. The omission of the 
distinction in the German book had no bearing upon my theory of induction or on the 
mathematical parts of the book; the truth tables of probability logic given in the German 
edition were not affected by the omission either. But since the distinction has turned out 
to be relevant from many logical aspects, it has been carried through in the new presentation. 
The method of derivation in probability logic was revised and greatly elaborated. 

The exposition of inductive methods in their various forms was widened, and a section 
on the inductive introduction of all-statements and the probability of hypotheses was 
added. Recent publications on the theory of statistical estimation were included in the 
frame of the discussion. 
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INTRODUCTORY CONSIDERATIONS 


§ 1. The Probability Concept of the Language of 
Everyday Life 

The word “probable” is frequently used in everyday language; more often, 
however, the concept is employed without being explicitly expressed. We 
must restrict to mere probability not only statements of comparatively great 
uncertainty, like predictions about the weather, where we would cautiously 
add the word “probable”, but also statements of so high a degree of prob¬ 
ability that we do not consider it necessary to mention the unavoidable 
uncertainty, or statements of the probability character of which we are not even 
conscious. Sometimes we express the uncertainty by a gesture or the accentu¬ 
ation of words. If we expect the plumber to come for some repairs we may 
communicate the news to the family by a shrug of the shoulders. But in 
many instances even this symbolic gesture is missing. Thus, when we go to 
the station to catch a certain train, it does not occur to us that because of 
an accident the train might for once be late. Even a desultory consideration 
of the statements of daily life shows clearly that a great number of them owe 
their character of “certainty” to a confusion of certainty with a high degree 
of probability. On close inspection, finally, it becomes evident that there are 
no statements of absolute certainty, if the statements are not to designate 
empty logical relations but to assert the existence of specific facts. 

Incidentally, it would be a mistake to believe that the concept of prob¬ 
ability concerns only statements about the future. In many statements about 
the past, we evidently use the concept of probability. The historian considers 
it very probable that Nero ordered the burning of Rome; he believes it less 
probable that Henrietta of England, who lived at the court of Louis XIV 
as the Duchess of Orleans, was murdered; he regards it as improbable that 
Bacon was the author of Shakespeare’s plays. Even events of the past can 
only be asserted as probable. 

Although we apply the concept of probability in daily life as a matter of 
course, we find it difficult to say what we mean by the concept “probable”. 
We know that a probability statement neither asserts nor denies the facts 
that it designates, but we do not restrict it to an assertion of the mere pos¬ 
sibility of an event, since we make distinctions in the degree of probability. 
For instance, we regard Nero’s responsibility for the burning of Rome as 
more probable than Bacon’s authorship of Shakespeare’s plays. But we have 
only a vague notion of the applicability of the concept without being able to 
explain its meaning . Since, furthermore, probability statements appear to 
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be grounded in the insufficiency of human knowledge, they seem to be more 
or less subjective. It is impossible to know with certainty who the Man in 
the Iron Mask was; his identity, however, is an objective fact, and if those 
who knew his origin had left a trustworthy account, they would have spared 
us the uncertainty. To know how the weather will be tomorrow is not possible 
for us; but we hope that future meteorology will predict the weather of the 
next day with certainty, or at least with the same certainty that the arrival 
of trains is predicted today. 

What can be the significance, for philosophical investigation, of a concept 
whose interpretation is vague and whose origin seems to be rooted in the 
inadequacy of human knowledge? The analysis of the probability concept of 
everyday language has, indeed, been rather fruitless for philosophical investi¬ 
gation. Its inefficiency has manifested itself in the philosophical critique of 
the probability concept carried through in traditional philosophy. Philosophers 
have been satisfied to construe probability as an uncertainty originating in 
the imperfection of human knowledge, and to connect the concept of prob¬ 
ability with that of possibility; this was virtually all that philosophy could 
discover so long as it restricted its studies to the probability concept of every¬ 
day language. Thus the first line of development of a theory of probability— 
which, incidentally, goes back to Aristotle—did not supply any significant 
results. 

Philosophers even tried to eliminate the concept of probability from science 
and to restrict it to prescientific language—which was an evasion rather than 
a solution of the problem. It cannot be admitted that everyday language 
uses concepts essentially different from those of scientific language. The 
analysis of science has shown that there is no sharp borderline between 
scientific and prescientific statements, that the concepts of daily life are 
absorbed by the language of science, in which they take on a more concise 
form and a clearer content without being abandoned. Thus it has become 
evident that criticism of fundamental concepts in their scientific formulation 
is more fruitful than reflection about their naive usage; that scientific expres¬ 
sion, through its precise wording, leads to a clearer interpretation of concepts 
and a deeper understanding of their meanings. 

It may be recalled that the analysis of the concepts of time and space 
remained futile so long as the philosophical discussion did not extend beyond 
the use of these concepts in daily life. Only with the elaboration of the scien¬ 
tific theory of space and time, carried through in non-Euclidean geometry 
and the theory of relativity, did it become possible to uncover the ultimate 
nature of space-time concepts and to achieve a more profound understanding 
of their application to daily fife. Thus we now have a better knowledge of 
what an architect means when he specifies lengths and widths in the plan 
of a building, and what a watchmaker does when he synchronizes a number 
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of watches. Another example is logic, the fundamentals of which, though con¬ 
tinuously applied in everyday language, remained unclarified until mathe¬ 
maticians undertook the analysis and formulation of logical relations. The 
probability concept, therefore, can be studied successfully only within the 
realm of its scientific application. 

From this point of view a second line of development of the probability 
concept—its evolution within the exact sciences—may be traced. 

§ 2. The Historical Development of the Scientific Concept 

of Probability 

The scientific evolution of the probability concept, which began with the 
construction of the mathematical theory of games of chance about the 
middle of the seventeenth century, is a striking example of the materialistic 
origin of intellectual developments. Wealthy noblemen, who spent their ample 
hours of leisure in the excitement of gambling, supplied the stimulus that 
induced ingenious mathematicians to construct the mathematical theory of 
probability. The thrill of complication had made the rules of games of chance 
so involved that some farsighted knights of the green table turned to eminent 
mathematicians like Pascal and Fermat for an exact calculation of all chances. 
The mathematicians undertook the task with increasing interest as they 
discovered the consistent logical structure of a mathematical theory. It is 
surprising that even Jacob Bernoulli developed his profound mathematical 
theorem in the pursuit of a detailed mathematical calculation of a variety 
of games of chance. 

If we compare the status of knowledge of the concept of probability as 
obtained from the theory of games of chance with the status resulting from 
the discussion of the probability concept of everyday language, we notice 
remarkable progress. Only in one respect did the analysis of probability 
concepts remain on the naive level: The idea that human ignorance is the 
source of the concept of probability was taken over by mathematicians, who 
recognized that probability statements about the die or the roulette are 
used only because it is impossible to portray the individual cast of the die 
or the turning of the wheel by mathematical laws. But one new insight 
was opened by games of chance: the discovery that probability statements 
can be transformed into statements of a high degree of certainty if they 
are applied to a great number of homogeneous cases, that is, if they are 
transformed into statistical statements. The relation between probability 
and frequency was made clear for the first time in the mathematical study 
of games of chance. The implications of this epistemological discovery will 
be discussed in a later chapter. 

About a decade after the fundamental inquiries of Pascal and Fermat 
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into the theory of games of chance, the calculus of probability was first 
applied to social statistics. The problems of life insurance were investigated, 
for at that time many of the middle class wished to secure a lifelong income 
with a single outlay of capital. The calculus of probability was applied, also, 
to statistics on the efficacy of medical treatments, the reliability of testimony 
in courts, and so on. The rationalism of the eighteenth century embraced 
such ideas with fervor. By the application of such rational methods to all 
the questions of daily life, Condorcet hoped to achieve the ideal that “our 
reason cease to be the slave of our impressions”. 

In continuation of these developments, the calculus of probability was 
applied to the mathematical sciences, particularly to astronomy and geodesy, 
where the theory of errors presented statistical questions. The construction 
of suitable methods, which is connected primarily with the names of Laplace 
and Gauss, bears witness to mathematical genius; it linked the calculus of 
probability to methods of mathematical analysis. What is more important, 
epistemologically speaking, is the fact that for the first time the probability 
concept found an application in the exact sciences. The study of games of 
chance had helped to clarify the meaning of probability. They became models 
by which the theorems of probability were explained. Even today, games of 
chance are significant in the mathematical calculus of probability because 
they represent easily understandable applications of probability laws. In 
social statistics and in the theory of errors, however, the field of application 
itself became the subject of scientific interest. 

The indispensability of the probability concept for the natural sciences 
became even more apparent when a new field of application was opened— 
the kinetic theory of gases and liquids. Whereas in the theory of errors a 
higher precision of observational results, an improvement in the numerical 
aspect of physical knowledge, had been achieved, the appearance of the prob¬ 
ability concept in the kinetic theory meant nothing less than its penetration 
into the concept of natural law, for which new perspectives were suddenly 
revealed. The statistical gas theory asserted that certain laws that had for¬ 
merly been considered to be strict physical laws were statistical laws, that is, 
laws of a probability character. This holds, in particular, for Boltzmann’s 
interpretation of the second principle of thermodynamics, the epistemological 
implications of which were not completely understood until our day. Boltz¬ 
mann’s theory implied that certain laws that had previously been regarded 
as strict laws of nature are not different from the statistical laws of games 
of chance, and that the law of great numbers, which was uncovered in the 
theory of games of chance, represents a general type of physical law. Thus 
the concept of probability was related to that of causality, and the concept 
of the statistical law of nature took its place beside that of the causal law 
of nature. 
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The fundamental significance of the probability concept was not generally 
acknowledged even then. Its application to the gas theory was regarded 
by many scholars as an expedient necessitated by human ignorance, by the 
impossibility of following the movements of individual gas molecules with 
the methods of physics. The adherents of this conception upheld the postulate 
that the events of the microcosm are controlled by strict causal laws, and they 
looked upon the laws of probability as a crude reconstruction of the macro¬ 
cosm to which we resort because of the insufficiency of human observations 
and calculations. A similar conception was customary in social statistics, in 
which it seemed obvious that statistical laws are the product of an omission 
of causal considerations that in principle can be carried through for each 
individual case separately. Even in Boltzmann’s gas theory, therefore, the 
ascendancy was gained by an interpretation that belittled the significance 
of the probability concept and from the very outset hindered insight into 
the epistemological position of the concept. 

At the same time, however, there developed a second conception that 
granted much greater significance to the concept of probability. Since certain 
laws, formerly regarded as strict causal relations, had been revealed as prob¬ 
ability laws, it seemed possible that the same fate awaited all the so-called 
strict laws of physics. According to this conception, the causal interpretation 
of nature represents a rather crude form of description that is necessitated 
by the inaccuracy of human observations and is made possible by the con¬ 
certed action of many elementary processes in macrocosmic phenomena. 
Whereas the first conception asserts the primacy of the causality concept 
over the probability concept, the second conversely maintains the primacy 
of the probability concept over the causality concept; the microcosm seems 
to be governed only by probability laws, whereas for the macrocosm there 
result statistical regularities that we take for causal laws and from which 
the idea of a strict causal determination has been incorrectly extrapolated. 

By these considerations the theory of the probability concept was connected 
with a third fine of development—the problem of causality—which was to 
join the other two in giving the concept of probability a leading position. 
Their merging, however, did not occur until our day. 

The idea of the strict causal connection of all natural phenomena is justly 
called a product of Western physical science, for it was the science of the 
modern age that, through the consistent application of the principle of cau¬ 
sality, especially in mathematical physics, brought the principle to the fore 
and made it the basic principle of a knowledge of nature. The philosophical 
criticism of the idea of causality could be attempted only after the principle 
had been sufficiently elaborated. 

The decisive turn was made in the criticism of David Hume, to whom we 
owe the greatest discoveries in regard to the logical structure of the causality 



8 


INTRODUCTORY CONSIDERATIONS 


concept. Hume recognized that the causality relation establishes a mere 
coordination of events and that all metaphysical ideas of an intrinsic connec¬ 
tion of events constitute anthropomorphisms having no objective meaning. 
It is only a relation of the form if-then , in the meaning of a concurrence 
without exception, that is asserted in the laws of nature. Bacon had empha¬ 
sized that this relation is established by means of inductive inference. Hume, 
however, saw that the peculiar structure of inductive inference calls for 
serious criticism. He considered the inference in its simplest form—induction 
by enumeration: observing repeatedly that an event A is accompanied by an 
event B, we infer that the concurrence will always take place. Although he 
saw that there are differences between scientific manipulation and everyday 
life, and that the scientist insists upon precise analysis of all factors involved 
in a phenomenon before he applies the inference, Hume realized clearly that 
such qualifications cannot eliminate the inductive inference from science and 
that all scientific inferences ultimately presuppose the legitimacy of induction 
by enumeration. Nor does the method of scientific experiment change the 
situation. Through experiment the scientist produces conditions that supply 
particularly instructive trains of events. That, however, under the same 
conditions the same thing will always happen is an indispensable assumption, 
in which inductive inference is applied. 

I do not intend to inquire now into the strange and enigmatic nature of 
inductive inference, which will be considered explicitly in later sections of 
the book. I only wish to point out that Hume’s insight into the problematic 
nature of inductive inference opened the path for the connection of causal 
with probability methods. The relation of inductive inference to probability 
is obvious, for we never claim that the inductive conclusion is certain; to 
what extent it can at least be called probable is a question I shall attempt to 
answer in the later analysis. 

Strangely enough, Hume’s early contribution toward a connection of 
causality and probability was forgotten in later philosophical developments. 
It was primarily Kant’s unfortunate doctrine of the apriority of causality 
that misdirected subsequent analysis of the problem. Although Kant, accord¬ 
ing to his own words, “was awakened from his dogmatical slumber” by 
Hume’s criticism, he was unable to do justice to Hume’s questioning of the 
legitimacy of inductive inference; and, in fact, he never attempted to apply 
his general theory of the a priori to a specific solution of Hume’s problem. 
Otherwise he would have seen that even the assumption of the a priori 
validity of the principle of causality cannot make inductive inferences dis¬ 
pensable for the discovery of individual causal relations. If, for instance, after 
repeated observation of the deviation of the magnetic needle by an electric 
current, we proceed to the assertion that always and everywhere the electric 
current will have the same effect on magnetic needles, it does not help our 
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inductive inference to know that there are causal relations between the 
determining factors of physical occurrences. The statement that the observed 
relation is generally valid, i.e., that we already have sufficient knowledge 
of the determining factors of the problem, still depends on inductive inference. 

Later, when the transition from thermodynamics to a statistics of gases 
was completed and the statistical law had found its place beside the causal 
law as a second type of natural law, critical investigations of the causality 
concept were connected with the analysis of the probability concept. These 
considerations originated in an inquiry into the peculiar relation existing 
between natural law and reality. The law never portrays the actual occurrence 
completely, but represents an idealization in which only certain prominent 
factors are considered, whereas an infinite number of other factors are neg¬ 
lected. Without such a schematization, natural events would be too complex 
for interpretation. 

This method, however, gives rise to a peculiar problem: we calculate the 
expected effect on the assumption of certain ideal conditions, although we 
know that the real conditions do not correspond completely to the ideal 
ones. The problem finds its solution in the discovery that an application of 
natural laws to reality is never expressed in certainty statements, but only 
in probability statements. It is through the coordination of the ideal structure 
to reality that the probability concept enters into the physical sciences. 
The existence of ideal conditions, or even of approximately ideal ones, cannot 
be asserted with certainty, but only with probability, though the probability" 
may be rather high if a sufficient tolerance is admitted for the conditions. 
The expected effect can, therefore, be predicted only with probability. It is 
possible to improve the probability of a prediction by a more precise analysis 
of the conditions, but we can never rid ourselves of probability statements. 

We might attempt the assumption that the probability of a prediction 
will approach certainty indefinitely with a more exact analysis of the deter¬ 
mining factors; this assumption, in fact, represents the strict form in which 
the hypothesis of causality is to be expressed. The very formulation reveals 
that the assumption cannot be regarded as a priori necessary, but that it 
is a matter of experience whether an increase in the probability of predictions 
is possible. It may be that the contrary is true, that the increase in the prob¬ 
ability of predictions is restricted to remain below a certain limit that is 
lower than certainty. Such a restriction would represent the transition to a 
more general form of natural law. This generalization, which I developed in 
earlier papers 1 in the sense of a possibility, was later recognized as an actuality 
by quantum mechanics, and was formulated in Heisenberg's principle of 
indeterminacy. 

1 H. Reichenbach, “Die Kausalstruktur der Welt und der Unterschied von Vergangenheit 
und Zukunft,” in Ber. d. bayer. Akadmath.-phys. Kl ., 1925, p. 138. 
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Thus the historical development of physics led to the result that the prob¬ 
ability concept is fundamental in all statements about reality. Strictly 
speaking, we cannot make a single statement about reality the validity of 
which can be asserted with more than probability. Only a theory of the 
probability concept, therefore, can supply an exhaustive analysis of the 
structure of statements about reality. This is why the theory of probability 
stands today in the focus of investigations that, within the frame of a scien¬ 
tific philosophy, are concerned with clarification of the nature of knowledge. 

§ 3. Remarks about the Plan of the Book 

From the preceding historical account it will be evident to the scientific 
philosopher that a satisfactory analysis of the probability concept can be 
carried through only in connection with a study of the probability concept 
as it developed in the mathematical calculus of probability. Only in the 
mathematical theory did the concept acquire the precise formulation that 
reveals its logical structure. The concepts that are applied in the process 
of knowledge are not always understood by the investigator from the begin¬ 
ning; only through continuous application do they assume clearer meanings 
until they reach a stage of determinateness that makes philosophical analysis 
fruitful. 

This book is not meant to analyze the connection of the probability concept 
with the problem of causality, but will be restricted to a treatment of the 
mathematical calculus of probability. Nonetheless, all the results may be 
transferred to every application of the probability concept in science or daily 
life. In spite of the derivation of ingenious theorems, the calculus of prob¬ 
ability as it has been formulated by mathematicians does not, however, 
posses^ the degree of logical strictness and clarity that is necessary forjphilo- 
sophical analysis. It is my intention in this work to present a construction 
of the calculus of probability that is mathematically as well as logically 
satisfactory, and then, returning to the logical and epistemological problems, 
to show that all the questions on the nature and application of the probability 
concept can be answered satisfactorily. 

The solution here presented for the problem of the application of the prob¬ 
ability concept to physical reality will look very different from the traditional 
conceptions. It will be shown that the analysis of probability statements 
referring to physical reality leads into an extension of logic, into a probability 
logic , that can be constructed by a transcription of the calculus of probability; 
and that statements that are merely probable cannot be regarded as asser¬ 
tions in the sense of classical logic, but occupy an essentially different logical 
position. Within this wider frame, transcending that of traditional logic, 
the problem of probability finds its ultimate solution. 
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The results of this inquiry demand a readjustment of traditional concep¬ 
tions of the foundations of knowledge. It is not surprising, however, that the 
analysis of probability should lead to far-reaching results. The concept of 
probability is not the instrument of some narrow scientific discipline; it is 
the fundamental concept on which all knowledge of nature is based, the 
interpretation of which determines the formulation of any theory of knowl¬ 
edge. Hence there is good reason for attributing so much significance to the 
analysis of probability. What is sought is not only an interpretation of a 
concept of mathematics and mathematical physics but the interpretation 
of the knowledge of nature, the answer to the question of the ultimate meaning 
of statements about the physical world. 

The construction of the calculus of probability presented in this book has 
the form of an axiomatic system. Some assumptions are assembled in the 
beginning and employed as axioms, and all the other theorems of the calculus 
are derived from them. This procedure has a mathematical advantage in 
that it presents the content of the mathematical discipline of probability 
in a logically ordered form. Moreover, it is of great help to a philosophical 
analysis because it exhibits clearly the logical problems of the theory and 
compels one to formulate exhaustively all presuppositions. 

Expositions given in textbooks of probability are often written with the 
intention of making the theorems of the calculus plausible to the reader. 
Certain presuppositions, therefore, are not mentioned but are regarded as 
“understood”; and it is assumed that this kind of presentation is the safest 
means of instilling belief in the theorems of probability. The method cannot 
be said to clarify the reasons why the theorems should be accepted, since 
the philosophical problems are usually contained in the presuppositions that 
are “understood”. If an assumption is treated as a matter of course, one 
should always suspect that it serves to conceal certain philosophical problems. 
By uncovering hidden assumptions, the axiomatic procedure accomplishes 
something philosophical: it raises unconscious laws of reasoning to the level 
of conscious assumptions. Formulation, therefore, is an important means of 
philosophical analysis; mathematical presentation becomes a tool of the 
philosopher when it is directed toward exhaustive formulation. 

Mathematicians have developed a method that entails the most precise 
formulation, namely, formalization. Formalization is the introduction of a 
set of symbols that permit us to abandon conversational language entirely 
and to express thought relations by relations of algorismic symbols. The 
value of the method consists in its avoidance of a language whose structure, 
originating from everyday needs, cannot express the logic of philosophical 
problems. Even if the manipulation of the algorism offers difficulties to the 
untrained student, his mental effort will be repaid by a clearer understanding 
of the subject and an abbreviation of the process of subsequent learning. 
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An algorism of this kind will be developed for the exposition of the mathe¬ 
matical calculus of probability. The task may be simplified greatly by the 
adoption of the algorism of symbolic logic, with the addition of a sign for 
probability. The algorism presents the fundamentals of the calculus of prob¬ 
ability in a clear and instructive way; it is expedient, too, for the formulation 
of complicated theorems and for practical applications. 

The usefulness of a symbolism that goes beyond that of mathematics is 
not widely recognized; in particular, mathematicians have not always been 
on friendly terms with logical symbolism. The student of probability, there¬ 
fore, may not be familiar with the symbolic technique. For this reason the 
following chapter offers a short introduction to symbolic logic. The presenta¬ 
tion is restricted to the parts of the technique that are used in the construction 
of the probability calculus given in this book. If the brevity of the exposition 
leaves unanswered questions in the mind of the reader, he may consult an 
exhaustive presentation on symbolic logic by the author. 1 

The logic presented in chapter 2 is called deductive logic. Its theorems are 
necessarily true; and if its methods of derivation are applied to true state¬ 
ments, the resulting conclusions will be true. This part of logic is to be dis¬ 
tinguished from inductive logic, which cannot guarantee the truth of its 
conclusions, and which will be presented in chapter 11. 

1 H. Reichenbach, Elements of Symbolic Logic (New York, 1947). Hereafter referred to 
as ESL. 
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§ 4. The Calculus of Propositions 

In its first part, the calculus of propositions, symbolic logic treats operations 
performed with propositions. A proposition is a sign combination that is 
either true or false. The terms “sentence” and “statement” will be regarded 
as synonymous with “proposition”. Examples are given by the propositions: 
“Berlin is situated on the Spree”; “Two times two is four”; “If potassium 
is thrown into water, it begins to burn”; “Hold is lighter than water”. The 
last sentence is, certainly, a false proposition, but that does not deprive it 
of its propositional character. Symbols arranged in a meaningless combina¬ 
tion, however, do not form a proposition. If someone says, “Light is a prime 
number”, or “Two times two equals and”, he does not make a statement, 
not even a false one; he only combines symbols without constructing a mean¬ 
ingful combination. The meaningless combination must therefore be dis¬ 
tinguished, as a third category, from true or false sign combinations; only 
the last two sign combinations are propositions. 

Propositions are denoted by variables a, ?>,.... The most important 
operations with propositions, namely, propositional operations, are 

d non -a (negation) 

a V b a or b (disjunction, logical sum) 

a.b a and b (conjunction, logical product) 

a D b a implies b (implication) 

a = b a is equivalent to b (equivalence, logical equality) 

By means of a propositional operation, a compound statement is constructed 
from the elementary statements a, b, , Thus by the operation of conjunc¬ 
tion the compound statement a.b is constructed from the elementary 
statements a and b. 

The terms “and” and “or” will be used in the description; such use is not 
circular, however, since it is not intended to define the terms, i.e., to replace 
them by others. They will be considered as primitive concepts the meaning 
of which is known. Explaining their meaning more precisely is not a definition, 
but a characterization of the propositional operations. 

The statement a is true if a is false. 

The statement a V b is true if a is true and b is false, or if a is false and b is 
true, or if a and b are true. This operation represents the inclusive “or”, 
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not the exclusive “or”, which conversational language expresses by “either- 
or”, and which excludes the case that a and b are true. For the exclusive “or” 
the symbol a A b will be used, but not often, since it can be replaced by 
other symbols, for instance, by the combination 

(a V b) . a.b (1) 

The inclusive “or” does not state that the case a and b must hold; it states 
only that this case may also hold. The name “logical sum” is used for the 
inclusive “or” to indicate a certain correspondence of the or-sign to the 
plus-sign of arithmetic. If, for example, things of a different kind are added, 
the or-connection of the respective concepts corresponds to the addition of 
the numbers. Thus 8 male persons plus 6 female persons equals 14 persons— 
this equation is true because the term “person” is the or-combination of the 
terms “male person” and “female person”. 

The statement a.b is true if both a and b are true. The name “logical 
product” used for this operation indicates a certain correspondence of the 
and-sign to the multiplication sign of arithmetic. If the numerical values 
belonging to concepts of different kinds are multiplied, the concepts enter 
into an and-combination. Thus 3 meters times 4 kilograms equals 12 meter 
kilograms. The concept “meter kilogram” may be conceived as the logical 
pair of the concepts “meter” and “kilogram”, that is, it denotes an object 
that is characterized by the concepts “meter” and “kilogram”. 

The statement a D b is true if a is true and b is true, or if a is false and b is 
true, or if a is false and b is false. This characterization of the implication, 
or if-then statement, which will later be discussed in detail, may seem unusual, 
but it corresponds to linguistic usage so far as it excludes only the second of 
the four possible combinations 

a.b a.b a.b a.b (2) 

An implication is therefore regarded as true if one of the three other com¬ 
binations holds. In the expression a D &, the term a is called the implicans 
and b the implicate . 

The statement a^b is true if a is true and b is true, or if a is false and b is 
false. The equivalence establishes the same truth value for a and 6. The 
equivalence relation plays a role in logic that resembles the relation of equality 
in arithmetic and is therefore sometimes called a logical equality. It is used 
in logical formulas to state that certain symbol combinations are logically 
equivalent, that is, have the same truth value. Assertions of this kind repre¬ 
sent a great part of the logical formulas that are derived in the calculus of 
propositions, though not every logical formula is an equivalence. It is easily 
seen that an equivalence means the same as implications in both directions, 
that is, ass6 holds if both' a D b and b Da hold (formula 7a). Whereas the 
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first implication excludes the second of the four combinations (2), the second 
implication excludes the third combination, so that only the first and the 
fourth remain, corresponding to the foregoing characterization of the equiva¬ 
lence. 

The given characterizations of the propositional operations are expressed 
in the truth table 1, in which the truth value V{a ) of a statement a, which 
is either truth or falsehood, is denoted, respectively, by the letters T and F. 


TABLE 1 
A. Negation 


V ( a ) 

V ( a ) 

T 

F 

F 

T 


B. Sum, Product, Implication, Equivalence 


Via ) 

Vib ) 

V(aVb) 

Via . b ) 

V(od6) 

Viamb ) 

T 

T 

T 

T 

T 

T 

T 

F 

T 

F 

F 

F 

F 

T 

T 

F 

T 

F 

F 

F 

F 

F 

T 

T 


Table 1 A refers to the monadic operation of negation; the argument column 
(left) contains the truth values T and F of the elementary proposition a; 
the functional column (right), the truth values of the compound proposition d. 
Table IB refers to the binary operations, which connect two propositions; the 
table has two argument columns (left) and determines, in each of the func¬ 
tional columns, the truth values T or F of the respective compound propo¬ 
sition. 

The truth tables can be read in two directions. When read from right to 
left, they state, for a compound statement given as true, the possible com¬ 
binations of elementary propositions. Thus if a V b is true, table IB furnishes 
the result that a is true and b is true, or a is true and b is false, or a is false 
and b is true. When read from left to right, the tables state, for given elemen¬ 
tary propositions, the truth value of the compound proposition. For instance, 
if a is true and b is true, table IB states that a V b is true. This distinction of 
directions leads to two possible interpretations of the truth tables. If only 
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the direction from right to left is used, we speak of the connective interpre¬ 
tation; if both directions are used, we speak of the adjunctive interpretation. 

In the adjunctive interpretation, the truth values of the elementary 
propositions determine the truth value of the compound proposition. Since 
in this relation no reference is made to the meaning of the propositions, 
the propositional operations of the adjunctive interpretation are called truth - 
functional. 

In the connective interpretation no such determination is possible. Thus 
if both a and b are known to be true, this interpretation admits of no state¬ 
ment whether a V b is true; the answer to the question depends on further 
information. However, if a V b is known to be true, this information is re¬ 
garded as sufficient for the statement of the disjunction of the T -cases of the 
compound statement, in the example, for the statement that a is true and 
b is true, or a is true and b is false, or a is false and b is true. 

An example of an adjunctive “or” is the statement, “The speech of the 
prime minister will be transmitted by station kfi or station knx”. If we 
observe that station kfi transmits the speech, the statement is regarded as 
verified. An illustration of a connective “or” is the statement, “A man 
suffering from severe diabetes takes insulin injections or will soon die”. We 
do not regard this statement as verified if we see a man taking insulin injec¬ 
tions; the totality of knowledge about the nature of diabetes is involved in 
the statement. However, the disjunction of the T -cases is employed in this 
instance; of the combinations (2) the statement excludes only the last com¬ 
bination. Since the meaning of the propositional operations depends on the 
interpretation, a distinction is made between adjunctive operations and con¬ 
nective operations. 

Conversational language employs both kinds of propositional operations. 
The two examples cited illustrate the two meanings of the inclusive “or”; 
the exclusive “or”, too, is used in both meanings. The “and” is used almost 
exclusively in the adjunctive sense. The implication, however, is usually 
meant in the connective sense. Thus we do not regard it as a verification of 
the statement, “If you take this medicine your cold will disappear”, when 
the medicine is taken and the cold disappears. Such a coincidence, observed 
in only one instance, might be due to chance. We would regard it as even 
less a verification if the medicine were not taken and the cold disappeared— 
a combination that represents one of the T -cases of the implication. 

The adjunctive implication rarely occurs in conversational language. 
Moreover, it leads to peculiar consequences that have been called the para¬ 
doxes of implication. Every false sentence implies adjunctively any sentence, 
and every true sentence is adjunctively implied by any sentence. For instance, 
the sentence, “The earth is a flat disk implies the earth does not revolve 
around the sun”, is true in the adjunctive sense of the word “implies” because 
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the implicans is false; and the sentence, “The earth is a flat disk implies the 
earth revolves around the sun”, is also true, for the same reason. The impli¬ 
cation, “Sugar is sweet implies water is wet”, is true because the implicate 
is true. These examples lose their paradoxical character if we realize that the 
word “implies” is used in a meaning different from the one accepted for 
conversational language. The adjunctive implication does not connect mean¬ 
ings, but simply adjoins two propositions according to rules referring to their 
truth values. Similarly, the equivalence of conversational language is usually 
connective. 

Symbolic logic employs adjunctive operations throughout. In doing so it 
makes use of the scientist’s privilege of defining his own simplified concepts. 
Such an attitude does not mean that the definition of connective operations 
is regarded as unnecessary; it is only postponed to a later stage. It is possible 
to construct connective operations from adjunctive ones, though such a 
definition is rather involved. 1 

The four binary operations of table 1 B (p. 17) are not the only possible 
operations. This can be seen from the fact that the functional columns con¬ 
tain only certain arrangements of the values T and F. Any other possible 
arrangement of the values T and F would also define an operation. There 
are 2 4 =16 possible arrangements all together, and thus 1G binary operations 
can be defined. One of them is the exclusive “or”, which is distinguished 
from the inclusive “or” in that the first line of its column contains an F 
instead of a T. The operations of table IB, however, are sufficient for all 
purposes, since the meaning of other operations can be constructed through 
a combination of these symbols. 

We do not even need all the four operations of table 1#, for some are re¬ 
ducible to others. Thus the statement d V b has the same arrangement of 
7”s and F y s in its column as the statement a D b; therefore, the latter state¬ 
ment may be defined by the first, and written 

a D b = d/ & V b (3) 

The sign — ds expresses equality by definition. On its left is the new symbol, 
or definiendum ; on its right, a certain combination of old symbols, called 
definiens. An expression like (3) is called a definition. A definition is often 
used to replace a long symbol combination by a short new symbol, which is 
called an abbreviation. 

Among the 16 possible arrangements of T”s and F’s in a column are two 
that must be discussed separately. The first contains a T in every line, and 
thus represents a combination of elementary statements that is true no 

1 See ESL, §§ 7, 9, and chap. viii. In the meaning of the term “adjunctive”, the term “ex- 
tensional” is sometimes used; but, since the meaning of “extensional” is not clearly defined, 
the term will not be used here. See ibid., p. 31. 
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matter what truth values the elementary statements have. The truth value 
of the compound statement is thus independent of the truth values of the 
elementary statements; it is always T. Such a combination is called a tautology, 
or an analytic statement. An example of a tautology is given by the statement 

a D b s a V b (4) 

Its tautological character is proved by case analysis, that is, by assuming 
for a and b successively the values T and F , and then applying the truth 
tables for each of the operations occurring in the formula. A T is then found 
for each of the four possible cases. 

The construction of tautologies is the very objective of logic. Since a tau¬ 
tology is true for all truth values of the elementary propositions of which it is 
composed, it is necessarily true. Tautological character, therefore, expresses 
logical necessity . All logical formulas are tautologies. 

The second arrangement to be considered is that in which the column of 
the truth table contains only the sign F. Such a formula is false for every 
combination of the truth values of the elementary statements and is there¬ 
fore called a contradiction . It follows from the given characterization that a 
contradiction is the negation of a tautology. By the negation of (4), for 
example, we obtain the contradiction 

a D b ss a V b (5) 

This definition explains why a contradiction is called necessarily false. Thus, 

to say that the formula _ 

J a s a (6) 

is contradictory means that it is false no matter what truth value the propo¬ 
sition a has. 

Statements that are neither analytic nor contradictory are called synthetic 
statements . Though they do not reveal the truth values of their elementary 
statements, they express a restricting condition for these truth values: the 
synthetic statement a V b excludes the case that both a and b are false. Thus 
they inform us about physical objects and situations. A tautology, on the 
contrary, does not exclude any case and thus does not inform us about any¬ 
thing; it is therefore an empty statement. The term “empty” must be care¬ 
fully distinguished from the term “meaningless” used previously. A mean¬ 
ingless symbol combination is neither true nor false, but the empty statement 
of a tautology is true. Its negation, therefore, is not meaningless either; it is 
false. Thus (6) is not meaningless but false. 

In the list on pages 21-22 a number of useful logical formulas are presented. 
The proof of their tautological character can be given by case analysis with 
the help of the truth tables. 

A formula written on a separate line should be taken to mean that the 
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formula is asserted to be true. Thus a symbol is saved for assertion. In order 
to avoid parentheses, a rule of binding force is introduced, expressed by the 
following listing of operations: 

strongest binding force *V D = weakest binding force 

The “and” has the strongest binding force, the equivalence the weakest. 
The bar of negation, which indicates the scope of the negation, connects like 
parentheses. Thus a V b. c means the same as a V (b. c); that is why paren¬ 
theses are omitted in this formula, whereas they cannot be omitted in 
(a V b). c. 

Tautologies in the Calculus of Propositions 


CONCERNING ONE PROPOSITION 


la. 

a ss a 

1 

lb. 

a V a = a 

f rule of identity 

lc. 

a.a s== a 

1 

Id. 

a S3 a 

rule of double negation 

le. 

a V d 

tertium non datur 

If. 

a.a 

rule of contradiction 

lg- 

a D a ss a 

reductio ad absurdum 


SUM 


2a. 

a V 5 = 5 V a 

commutativity of “or” 

2b. 

a V (6 V c) = (aV6)Vc = aV6Vc 

associativity of “or” 


PRODUCT 


3a. 

a.6 ss 6.a 

commutativity of “and” 

3b. 

o 

fO 

Q 

III 

S' 

III 

e 

associativity of “and” 


SUM AND PRODUCT 


4a. 

fl.(6 Vc) = ai Va.c 

1st distributive rule 

4b. 

a V 6. c S3 (a V 6). (a V c) 

2d distributive rule 

4c. 

4d. 

(oV5).(cVd) = a.cV5.cVa.dV5.d 1 

a . 6 V c.d ss (a V c). (6 V c). (a V d) . (b V d) J 

> twofold distribution 

4e. 

a.(aV&) = a Va.5 = a 

redundance of a term 


NEGATION, PRODUCT, SUM 

5a. 

5b. 

o.fe s= a V 6 ) 

a V b = a.b j 

► breaking of negation line 

5c. 

a. (b V 6) = a 

dropping of an always- 



true factor 

5d. 

a V 6.5 = a 

dropping of an always- 



false term 

5e. 

a Vfi.6 s a V6 

redundance of a negation 
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IMPLICATION, NEGATION, PRODUCT, SUM 


6a. a Db = a V b 
6b. a D b == a .b 
6c. a Db == 6 D d 

6d. a D (6 D c) see 6 D (a D c) s a. 6 D c 
6e. (a D 6). (a D c) == a Db.c 
6f. (a D c) .(b D c) = a \l b D c 
6g. (a D 6) V (a D c) = a D 6 V c 
6h. (a D c) V (!) D c) = a.b D c 


dissolution of implication 

contraposition 
symmetry of premises 


merging of implications 


EQUIVALENCE, IMPLICATION. NEGATION. PRODUCT, SUM 


7a. (a ss 6) ss (a Db) .(b D a) 
7b. (a s= b) = a.b V a.b 
7c. a = 6 = (a es 6) 

7d. (a s 6) s (ff s 6) 


dissolution of equivalence 

negation of equivalence 
negation of equivalent 
terms 


ONE-SIDED IMPLICATIONS 


8a. a D a V 6 

8b. a.b D a 

8c. aD (b D a) 

8d. d D (a D b) 

8e. a. (a D b) Db 

8f. (a D b) D (a D b V c) 

8g. (a D b) D (a.c D b) 

8h. (a V c D b) D (a D b) 

8i. (a D 6. c) 3 (a D b) 

8j. (aDb).(cDd) D (a.cDb.d) 
8k. (aDb).(cDd) D (a VcDbVd) 

81. (aDb).(bDc) D (a D c) 

8m. (a = b ). (b s= c) D (a = c) 


addition of an arbitrary 
term 

implication from “both” 
to “any” 

arbitrary addition of an 
implication 
inferential implication 
addition of a term in the 
implicate 

addition of a factor in the 
implicans 

dropping of a term in the 
implicans 

dropping of a factor in 
the implicate 
derivation of a merged 
implication 

transitivity of implication 
transitivity of equiva¬ 
lence 


The statements and formulas so far considered belong in the object language , 
the language in which we speak about physical objects. When we speak about 
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symbols, however, we use another language, which is called the metalanguage. 
The assertion that a given statement is true belongs in the metalanguage and 
thus is one level higher than the original, or object, language. Similarly, the 
terms “tautology”, “synthetic”, and so on belong in the metalanguage. The 
distinction between the different levels of language is one of the important 
discoveries of modern symbolic logic. 2 

The distinction finds a significant application in the construction of con¬ 
nective operations. When we write the tautology 

aVb D a (7) 

the implication of the formula, although adjunctive, supplies a “reasonable” 
implication that is free from the paradoxes mentioned above; it is an impli¬ 
cation of logical necessity and may therefore be regarded as a connective 
implication. Connective operations can be defined by the help of the meta¬ 
language; thus a tautological implication is a connective implication. Simi¬ 
larly, a tautological equivalence like (4) supplies a connective equivalence; 
it expresses the relation of having the same meaning. In fact, the definition (3) 
is admissible only because (4) is a tautological equivalence. However, only 
analytic connective operations can be constructed thus. Synthetic connective 
operations, especially a synthetic connective implication, require further 
means for their definition. 

§ 5. The Method of Derivation 

The superiority of symbolic logic over the older forms of logic consists in 
the fact that logical symbolism can be used, like mathematical symbolism, 
for the derivation of other formulas from given ones. All derivations can be 
reduced to the application of two rules: 

1. Rule of substitution. For a propositional variable in a logical formula 
it is permissible to substitute any other propositional variable, or a proposi¬ 
tional constant, or any compound expression composed of such variables or 
constants. 

Thus in formula (4, § 4), c V d may be substituted for a , so that the formula 

cVdDb = cMd Vb (1) 

results. This formula, like (4, § 4), is a tautology. 

2. Rule of inference. If it is known or has been proved that both a 
and a D b are true, then b may be asserted. 

2 The transition from a sign to a sign for that sign is often expressed by the use of quota¬ 
tion marks. In fact, many of our previous statements should be quoted. Thus I should write, 
“The statement l a V b’ is true”. I omit the quotes for the sake of simplicity and follow the 
widely accepted rule that their function is, in general, taken over by italics; thus symbols in 
italics are the names of the corresponding roman symbols. There are a few exceptions to this 
rule, which, however, are easily understood. 
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Thus the two statements a and a D b may be omitted and b asserted sep¬ 
arately. The rule of inference may be symbolized by the schema 

a 

aDb (2) 

b 

In traditional logic the rule of inference is called modus ponens. 

It is important to realize that the two rules cannot be expressed by the 
symbols of the calculus, because they are not formulated in the calculus 
but speak about the calculus. They refer to a procedure—the procedure of 
derivation—that is applied to statements in the object language, and there¬ 
fore they belong in the metalanguage. If a symbolic notation of these rules 
is desired, it could be constructed only in the metalanguage; in fact, schema 
(2) can be regarded as a symbolism belonging in the metalanguage. A com¬ 
plete symbolization, however, would include symbols for such terms as 
“propositional variable”. 

A rule is not a statement; it is a directive , i.e., it does not state a matter 
of fact, but has a character similar to that of a command. It differs from a 
command only so far as it does not order what is to be done, but grants a per¬ 
mission. 1 The phrases “may be substituted” and “may be asserted”, used in 
the two rules, indicate this directive character. A rule, therefore, cannot be 
proved to be true or false; these concepts do not apply to directives. Instead, 
a directive requires a justification. It must be shown that the directive serves 
the purpose for which it is established, that it is a means to a specific end. 
The procedure of derivation, to which the rules refer, is intended to supply 
true statements; a justification of the rules is therefore achieved if it can be 
shown that the rules will always provide true statements. The proof, which 
is easily given, can be formulated in a metatheorem. Thus we can prove the 
metatheorem that if the rules of substitution and inference are applied to 
true statements the resulting statements are true. Unlike a rule, a meta- 
theorem is a statement (in the metalanguage) and thus is true or false. 

That the metatheorem supplies a justification of the rule is a consequence 
of the aim of derivation. If we had the aim of deriving false formulas, the two 
rules would not be a suitable means and therefore would not be justifiable. 
Or let us consider the aim of deriving tautologies. For this aim the rule of 
inference, in the form given, is not justifiable; the rule must be restricted 
to the use of tautological premises, since only then will it supply a tautological 
conclusion. A justification can be given only with respect to a certain aim. 

That the rules of deductive logic require a justification has often been over¬ 
looked. The oversight seems to be connected with the fact that the justifica- 

1 A permission has the imperative character of a command, though in a weakened form. 
See E3L } p. 343. 
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tion is easily given. It is quite the contrary with inductive logic, in which 
the problem of justification is extremely difficult (see § 91). The justification 
of induction has been discussed since the time of Hume. If some philosophers 
believed they could deny the existence of the problem, they were misled by 
the erroneous conception that deductive logic could be established without 
a justification of its rules. The recognition of all rules as directives makes it 
evident that a justification of the rules is indispensable and that justifying 
a rule means demonstrating a means-end relation. 

For the purpose of derivation it is advisable to introduce some further, 
or secondary, rules, which can be derived from the two fundamental rules. 
One is the rule of replacement, which permits the replacement of a proposi¬ 
tional expression by another that is tautologically equivalent. Thus we can 
replace the expression a D b by the expression a V b and proceed from the 

formula (a - b) - (a 0 b) .(b D a) (3) 

to the formula (a V 6). (6 3 a) (4) 


This rule is frequently applied in derivations. A replacement differs from a 
substitution in that it need not be done in all the places where the replaced 
expression occurs; furthermore, it may be applied to compound expressions, 
as in the example given, and is not restricted to a replacement of elementary 
variables. 

During the procedure of derivation, the meaning of the formulas need not 
be understood; it is sufficient to consider the formulas as combinations of 
signs with which certain operations are performed. This treatment of formulas 
is called the formal conception of the system. Only when we refer the formulas 
to physical objects must we understand them, that is, know the meaning 
of the symbols. We then apply material thinking and say that the system is 
given an interpretation, or employed in an interpreted conception. 

It should be noted that the formal conception, during a derivation, is 
restricted to the object language. The metalanguage must be used in an 
interpreted conception. Thus, in order to make an inference, one must know 
whether the given formulas satisfy the conditions expressed in the rule of 
inference. Formal manipulations with formulas require material thinking in 
the metalanguage. 

The method of derivation enables us to derive many of the formulas in 
the list (pp. 21-22) from others. It has been shown that all formulas of the 
calculus of propositions are derivable from a few axioms. The axioms are: 

a V a D a 
a D a V b 
a V6 36 Va 
(a D b) D (cV aD cVb) 


( 5 ) 
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The tautological character of the axioms follows by case analysis. Every 
formula derivable from them is also a tautology. The implication in the 
axioms is regarded as defined by (3, § 4). 

§ 6. The Calculus of Functions 

In the calculus of propositions, the propositions are regarded as wholes and 
therefore represent undivided units for all operations. The part of logic that 
analyzes the inner structure of propositions and employs it for operations is 
called the calculus of functions. 

Consider the proposition “Aristotle was a Greek”. It is symbolized in the 
calculus of propositions by the single letter a. But in the calculus of functions 
we refer to the fact that the proposition has an inner structure; and we 
separate the subject Aristotle, about whom something is said, from what is 
said about him, namely, his property of being a Greek. As the symbol of this 
relation we employ that of the mathematical function and symbolize the 
proposition in the form 

fix) (1) 

The argument sign x corresponds to the subject about which we speak; 
the function sign / corresponds to the property holding for it. The symbol 
(1) indicates the inner structure, or the form, of the proposition. 

Since/ and x are variables, the expression (1) can be given various meanings 
by suitable specialization of these variables. If only / is specialized, say, as 
meaning the property of being a Greek, the expression (1) is not yet a propo¬ 
sition; it can be true or false, depending on what we substitute for the variable 
x. Thus (1) is true if we put “Aristotle” for x, and false if we put “Goethe” 
for x. The sign / is called a propositional function; the combination fix) is 
called a functional. If we wish to indicate the variable in a propositional 
function, we write fix)) the circumflex distinguishes this expression from the 
functional f(x). The expression fix), therefore, means the same as /. In the 
same meaning as the term propositional function , the word predicate is often 
used. The object correlate of the propositional function, the property denoted 
by it, is called a situational function. For instance, the property, or situational 
function, of redness is denoted by the predicate, or propositional function, 
“red”. 

There are two ways of constructing a proposition from a propositional 
function, or from a functional. The first has been mentioned. We substitute 
a constant Xi for the variable x , that is, we go from fix) to fix i). This is the 
method of specialization. It is not permissible, however, to substitute for x 
a value that makes fix) meaningless. Thus if / means being a Greek, and 
we put for x the number 4, the resulting expression, “The number 4 is a 
Greek”, is meaningless. Besides the method of specialization there is a second 
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procedure for constructing a proposition from a propositional function: the 
method of binding the variables , l which has two subdivisions. 

The first form of binding the variables is the method of generalization, 
which employs the all-operator, written (x), We thus have 

(*)/(*) ( 2 ) 

We read this expression in the form, “For all x, /(x)”. It is a statement, 
not a propositional function, since it is either true or false. If, for instance, 
/ means the property of being a Greek, (2) is false; if / is to mean the prop¬ 
erty of being a part of nature, (2) is true. Further examples of true all-state¬ 
ments obtain if we put a more complicated expression for /(x), for instance, 
an implication. Thus we arrive at an expression of the form 

(x)[/(x) D a(x)] (3) 

If we understand by/ the property of being a man, by g the property of being 
mortal, (3) states that all men are mortal, and thus is true. In fact, we may 
substitute for x whatever we like, for instance, Mount Everest, since it is 
correct to say, “If Mount Everest is a man it is mortar’. 

The second form of binding the variables is constructed by the help of 
the existential operator, written (gx). From /(x) we thus construct the 
statement 

(gx)/(x) (4) 

which is read, “There is an x such that /(x)”. Like (2), the expression (4) is 
a statement, because it is either true or false. If we understand by/ the prop¬ 
erty of being a Greek, (4) is true; if we interpret / as meaning the property 
of being an inhabitant of the moon, (4) is false. 

The binding of variables thus supplies a statement of a peculiar inner 
structure. The statement contains a variable x, but as a whole it does not 
depend on the variable. Such a variable is therefore called a bound variable 
in contradistinction to a free variable. To illustrate the nature of a bound vari¬ 
able, we may refer to the variable of a definite integral, which is confined to 
the inside of the expression, whereas the whole expression does not depend 
on it. 

The expression (3) is called a general implication. Although the implication 
sign occurring in it is of an adjunctive nature, the general implication is free, 
to a certain extent, from the paradoxes of the individual implication. The 
condition that the adjunctive implication hold for all x excludes certain 
paradoxical applications. Thus the foregoing example concerning the earth 
and its revolution around the sun is ruled out when we require for a reasonable 
implication that it be valid for all things. The statement, “For all x, if x is 

1 This method is frequently called “quantification”; the operators then are called “quan¬ 
tifiers”. 
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a flat disk it revolves around the sun”, is not true, even if we interpret the 
implication as adjunctive. A reasonable individual implication includes a 
reference to generality; the meaning of implication in conversational language 
is constructed by a transfer of meaning from the general to the particular case . 
Thus the implication, “If you heat this piece of iron it will expand”, is a 
reasonable individual implication because it is a special case of a valid general 
implication. 

The general implication is therefore an instrument for the definition of a 
synthetic connective implication, also called physical implication, which differs 
from the tautological, or analytic connective , implication (7, § 4) in that it 
expresses a physical , not a logical, relation. The complete definition of this 
kind of implication, however, requires some further means; in particular, the 
case of an always-false implicans and of an always-true implicate must be 
excluded, since otherwise the paradoxes of the individual adjunctive impli¬ 
cation reappear. For this definition, which is given in the frame of a general 
theory of connective operations, the reader may consult another publication 
by the author. 2 

The all-operator can be regarded as a generalization of the “and”, the 
existential operator as a generalization of the “or”. For a finite range of 
the variables we thus have the tautological equivalences 

(x)f(x) 5 f{x i) .f(x 2 ) . . . f(x n ) (5) 

(3*)/(*) = fix i) Vfixt) V . . . V/(0 (6) 

These relations, however, cannot be used in defining the operators, since the 
operations “and” and “or” are not defined for an infinite number of terms. 
The operators have, therefore, an independent meaning and can be regarded 
as generalizations of the operations “and” and “or” for an infinite number 
of terms. 

So far, propositional functions of one variable, or one-place functions, 
have been considered. There are also propositional functions of several vari¬ 
ables, or many-place functions. Their object correlates, i.e., the situational 
functions denoted by them, are often called relations , in contradistinction 
to properties , i.e., situational functions of one variable. Many-place func¬ 
tions are written in the form/(£,$), /(x,y,z), and so on. The circumflex nota¬ 
tion is convenient for such functions because the symbol / would not indicate 
the number of variables. 

Many-place functions frequently play a part in the language of everyday 
life. Thus the proposition “Peter is the brother of Paul” contains the propo¬ 
sitional function “x is the brother of y” } which is symbolized by f(x,y); the 
sentence itself is symbolized by the functional f(x,y). “To be brother” there- 


2 ESL, chap. viii. 
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fore denotes a binary relation. A propositional function of three variables 
/(x,y,z) occurs in the example, “The mother gives an apple to the child”, 
where the variables x,y,z are represented by the expressions “the mother”, 
“an apple”, and “the child”, and the term “gives” is symbolized by the 
function sign /. Another example of a ternary relation is the relation denoted 
by the word “between”, “y is situated between x and z” contains a proposi¬ 
tional function /(x,y,z). 

The two procedures for constructing propositions can be applied also to 
propositional functions of several variables; the individual variables can be 
specialized or bound separately. Thus f(x,y) is a propositional function con¬ 
taining one variable y. The two variables can also be bound in a different 
way. From f(x,y), for instance, we can construct the proposition 

0)(3 y)f( x >v) (7) 

Another binding of the variables is given by 


(3x)(y)f(x,y) (8) 

Athirdis (32 (9) 

Statements (7) and (9) are not identical because the order of the operators 
is relevant. This fact may be demonstrated by two examples in which we 
replace f(x,y) by the more complicated expression 

9(x) lg(y)-h(x,y) (10) 


If we understand by g “to be a natural number”, by h “to be smaller”, then 
(7) holds, since it then means, “For every natural number x there is a natural 
number y such that x is smaller than y”. Statement (9) means, in this inter¬ 
pretation, “There is one number y such that all x are smaller than y”. The 
statement, however, is false, since there is no greatest number. We obtain 
an example satisfying both formulas if we understand by h the relation 
“to be not smaller”. Statement (9) then says, “There is a natural number y 
such that all natural numbers are not smaller than y”. Such a number y 
exists: it is the number zero. Statement (7) is then a fortiori satisfied. It is 
obvious that (9) is the stronger statement; (9) implies (7), but not vice versa 
(16a). Whereas (7) represents the meaning of “every”, (9) formulates the 
meaning of “all”. The distinction between “all” and “every” is therefore 
expressed in terms of the order of operators. Two operators of the same kind, 
however, are commutative (15a, 155). 

An important connection between all-statements and existential statements 
is expressed by the formulas 


(x)f(x) SB (3x)/(x) 
(jx)f(x) SS (x)f(x) 


( 11 ) 

( 12 ) 
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They exhibit an important feature of the notation of the calculus of func¬ 
tions. The binding of variables is symbolized by a prefixed operator in order 
to express the two forms of negation resulting from the possibilities of negating 
the whole expression and the function alone. The expression 

(x)f(x) (13) 

therefore must be distinguished from the expression 

W/W (14) 

The two statements are distinguished by the scope of the negation, which 
in (13) includes the operator, whereas it does not in (14). The expression 
(14) is the stronger statement, that is, (14) implies (13), but not vice versa 
(13c). 

The expression /(or) in (2) and (4) is called the operand of the statement; 
the whole expression, including the operator and the operand, is called the 
scope of the operator. Not always does the scope of an operator extend over 
the whole formula. Thus, in the expression 

0*0/0*0 D (3 y)v(y) ( 15 ) 

the scope of the all-operator is the implicans, and that of the existential 

operator is the implicate. Whereas (2) and (4) are onc-scope formulas, (15) 
is not. 

In (7) the operand of the existential operator is the expression f(x,y ), 
whereas the operand of the all-operator is given by the expression (jy)f(x,y). 
This distinction explains the difference in the meanings of (7) and (9). In 
spite of this distinction, formulas like (7) and (9), in which the operators are 
assembled at the beginning and extend over the rest of the formula, are 
regarded as one-scope formulas. 

For reasons of expediency, a combination of an operator with a proposition 


will be regarded as meaningful, 

and as meaning the same a 

s the proposition. 

Thus we have 

(x)a = a 

(16) 


( 3 x)a = a 

(17) 


Then a is called a constant. These relations are useful, in particular, when 
a is a functional expression that does not contain the variable x . 

It is often expedient to express generality, not by means of an operator, 
but through a free variable. For this purpose the convention is introduced 
that if an expression containing a free variable is asserted, it is meant to hold 
for all values of the free variable. Thus the expression 


f(x) V f(x) 


(18) 
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is true for all values of x and can be asserted; the variable x is then a free 
variable. The free variable, therefore, has the same meaning as an all-operator 
the scope of which is the whole formula. So it is permissible, in formulas 
containing free variables, to put an all-operator referring to these variable's 
before the whole formula as its scope; this is called the rule for free variables. 

Free variables have been used in the calculus of propositions, since the 
letters a, b , and so on, occurring in tautologies, express free variables. Simi¬ 
larly, the letter / in (18) expresses a free variable. The use of free variables 
is possible when the scope of the generalization is not restricted to parts of 
the formula. Mathematical notation makes frequent use of free variables. 
In a mathematical identity like 

(x + y) 2 — x 2 + 2 xy + y 2 (19) 

the letters x and y represent free variables the range of which is the domain 
of numbers. In conversational language, a free variable is expressed by the 
word “any”. The means of expression, however, are limited for free variables; 
the difference between the expressions (13) and (14) is not expressible in 
terms of free variables. 

The concept of tautology can be defined for functions, too. A tautology 
is a formula that is true for all values of the argument and of the functional 
variables. The derivation of tautologies, however, is more complicated, since 
a method of case analysis for functions cannot be carried through generally, 
although it can be applied, in a generalized form, to certain kinds of formulas. 
A table of tautologies in functions is presented below. 

The axiomatization of the calculus of functions has shown that, besides 
the axioms (5, § 5) of the calculus of propositions, it is sufficient to employ 
formulas (14a) and (14?>) as axioms. All tautologies in functions are then 
derivable. 

The functional calculus so far developed is called the simple calculus of 
functions. It is distinguished from the higher calculus of fmictions, in which 
functional variables are bound by operators or occur as arguments of func¬ 
tions of a higher type. The higher calculus will not be presented in this book 
because it is not required for the exposition of the theory of probability. 

Tautologies in the Calculus of Functions 

FORMULAS CONCERNING FUSION OR DIVISION OF OPERANDS 

9a. (x)[/(x).g(x)] == (x)f(x). (x)g(x) 

96. (x)/(x) V (x)g(x) 3 (x)[/(x) V g(x)] 

9c. (x)[/(x) V g{x)\ 3 (x)/(x) V (gx)y(x) 

9 d. (x)[/(x) 3 g(x)] 3 [(x)/(x) 3 (x)g(x)] 

9e. (x)[/(x) 3 g(x)] 3 [(3x)/(x) 3 ( 3 x)g(x)] 

9/. (x)[/(x) = g(x)\ 3 [(x)/(x) = (x)g(x)] 
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9 g. (x)[/(x) = g(x)} 3 [(3x)s(x) =* (3x){/(x)] 

9 h. (x)f(x). (x)[f(x) 3 g(x)) 3 (x)g(x) 

lOo. (3 x)f(x). g(x) 3 (3x)/(x). (3x)^(x) 

106. (3x)[/(x) V g(x)] a (3x)/(x) V (3x)g(x) 

10c. (3x)[/(x) 3 g(x)] = (x)/(x) 3 (3x)ff(x) 

10d. [(3x)/(x) 3 (3x) ff (x)] 3 (3x)[/(x) 3 </(x)] 

lOe. [(3x)/(x) 3 (x)g(x)] 3 (x)[/(x) 3 ff(x)] 

10/. (3x)/(x) . (x)jf(x) 3 Qx)[f<x) . g(x)] 

11a. (x)[a .f(x)] = a. (x)/(x) 

116. (x)[a V/(x)] = a V (x)/(x) 

11c. (x)[a 3/(x)j s a 3 (x)/(x) 

lid. (x)[/(x) 3 a] ss (3x)/(x) 3 a 

lie. (x)[/(x) = a] 3 [(x)/(x) = a] 

11/. t(x)a] = a 

12a. (3x)[a ,/(x)] s a. (3x)/(x) 

126. (3x)[a V/(x)] = a V (3x)/(x) 

12c. (3x)[a 3/(x)j = a 3 (gx)/(x) 

12d. (3x)[/(x) 3 a] s (x)/(x) 3 a 

12c. [( 3 x)/(x) s a] 3 ( 3 x)[/(x) = a] 

12/. [(3x)a] = a 

FORMULAS CONCERNING NEGATION OF OPERATORS 

13a. (x)/(x) = (3x)/(x) 

136. (3x)/(x) ■ (x)/( x) 

13c. (x)/(x) 3 (x)/(x) 

13d. ( 3 x)/(x) 3 ( 3 x)/(x) 

FORMULAS OF SUBALTERNATION 

14a. (i/)/(j/) 3/(x) 

146. fix) 3 (3J/)/(i/) 

14c. (x)/(x) 3 (3x)/(x) 

FORMULAS CONCERNING THE ORDER OF OPERATORS 

16a. ( x)(y)f(x,y) m (y)(x)f(x,y) 

166. ( 3 x)( 3 j/)/(x,j/) . ( 3 J/)( 3 x)/(x,j/) 

16a. (jx)(y)f(x,y) 3 (j/)(3x)/(x,y) 

166. (3x)(y)/(x).ff(y) s (y)( 3 x)f(x).g(y) 

16c. (3x)(y)[/(x) V g(y)] m (y)( 3 x)[f(x) V ff(y)] 

16d. ( 3 x)(j/)[/(x) 3 g(y)] m (y)( 3 x)[/(x) 3 *&)] 

16c. (x)(y)[/(x,y) V g(x,y)] 3 (3x)(y)/(x,y) V (x)(3y)ff(x,y) 
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17 a. (x)(y)f(x,y) D (s)/(s,x) 

176. (3x)(»y(x,») D (3 x)f(x,x) 

§ 7. The Calculus of Classes 

The calculus of functions leads to the introduction of a new and important 
concept: the concept of class. If a propositional function f(x) is given, all 
arguments x that satisfy f(x) can be incorporated in one class, the class F. 
Every propositional function thus defines a class. Vice versa, every class can 
be regarded as defined by a propositional function. The arguments x for 
which f(x) is true are called members , or elements , of the class F . For the 
expression “x is a member of F”, we write 

xeF (1) 

The symbol t corresponds to the copula of conversational language, for 
instance, to “is a” in “Saddle Peak is a mountain”. 

Because of the equivalence of propositional functions and classes it is 
not necessary to consider the concept of class as a primitive concept. We 
can rather conceive (1) as defined by f(x): 

xeF — d/ fix) (2) 

It should be noted that we do not thus define the concept of class, but the 
total expression, “x is a member of the class”. This is not a deficiency of the 
theory, since the concept of class is never used independently. All statements 
referring to classes are translatable into statements in which the total com¬ 
bination (1) occurs. It suffices, therefore, that the combination has a meaning. 
The definition (2) is called a definition in use of the concept of class. The term 
“set” is used in the same sense as the term “class”. The class F is also called 
the extension of the propositional function /(x). 

Similarly, we can go from many-place functions to the corresponding class. 
The extension of a function f(x,y) is given by the class of the couples x.y 
that satisfy the propositional function. In analogy with (2) we write 

x.y e F = 0 / f(z,y) (3) 

From functions of three variables we can go in a similar way to the class 
of triplets. The procedure can be extended to functions of more variables. 

The calculus of classes is concerned with operations that can be performed 
with classes. We obtain these operations by performing operations with the 
corresponding propositional functions and then transferring the results to 
classes. 

To begin with propositional functions of one argument: since the range of 
the arguments that make f(x) meaningful is a definite one, the class of the 
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arguments x that make f(x) false is as well determined as the class of the 
arguments that make f(x) true. The class of the arguments that make f(x) 
false is called complementary class , or complement of F, demoted by F. In 
symbolic notation the definition reads 

x e F = D j x e F (4) 

Thus the class of the nonprime numbers is a well-determined class. 

The introduction of the complementary class can be conceived as an 
operation by which a certain class is constructed from a given class. This 

operation is analogous to the operation 
of negation applied to propositions. 
Correspondingly, we can define opera¬ 
tions between classes that are analogous 
to the other operations of the calculus 
of propositions. We then construct a 
new class H from the two given classes 
F and G by the corresponding opera¬ 
tion. This procedure can be carried 
through in two different ways: the new 
class can be constructed as a class of 
members x, so that it is of the same 
kind as the original classes; or the cou¬ 
ples x,y , which are constructed from 
members of the two classes F and G , can be used as members of the new class. 

Beginning with the first procedure, we define 

x € F V G = Df (x e F) V (x e G) joint class, or disjunct (5) 

x € F .G — Df (x c F) . (x € (?) common class, or conjunct (G) 

The classes so constructed are illustrated in figure 1. The joint class is pro¬ 
duced by joining the two classes F and (7; their common members, however, 
are counted only once. For instance, if two societies F and G sponsor a con¬ 
gress, persons who are members of either of the societies are entitled to take 
part, no matter whether the same person belongs to both societies. The class 
of the persons entitled to take part in the congress is therefore the joint class 
of the two societies. The common class is composed of persons who belong 
to both societies. 

The second procedure mentioned provides classes the elements of which 
are the couples x f y. Two definitions, analogous to the preceding definitions, 
can be used: 

x y y c FzV Gj = Df (x c F) V (y e (?) couple disjunct (7) 

x,y Fx.Gi = D f (xeF).(y e G) couple conjunct, or combination class (8) 



Fig. 1. Joint class and common class: 
joint class, according to (5), of classes F 
and G is whole shaded area; common class, 
according to (6), is double-shaded area. 
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The circumflex over the subscripts has a similar meaning as in the expression 
/(x); it effects a sort of binding of the variable, and substitutions for the 
other letters x and y can be made without a change in the subscripts. The 
difference between these definitions and the two preceding ones becomes 
clear when the corresponding propositional functions are considered. The 
definitions (5) and (0) are equivalent to the transition from the two proposi¬ 
tional functions/(x) and g{x) to the new propositional function of one variable: 

fix) V g(x) (9) 

Kx).g{x) (10) 

The definitions (7) and (8), however, are equivalent to the transition from 
the same propositional functions to the new propositional functions of two 
variables: 

/(*) V g(y) (11) 

f(x).g(y) (12) 

To the function (11) corresponds the class defined in (7); to the function (12), 
the class defined in (8). An example of a couple conjunct, or combination 
class, is the class of possible telephone connections between subscribers in 
two cities. The corresponding couple disjunct is represented by the class of 
telephone connections that any of the parties in either city can have with 
any party in the entire country, that is, the class of telephone connections 
for which at least one party is a subscriber in one of the tw T o cities. 

A third form of operations will now be considered. The formation of classes 
defined in (7) and (8) can be specialized by restricting membership to certain 
couples x,y that are selected by a coupling relation e(x,y) that establishes a 
one-one correspondence. The one-one correspondence may be symbolized by 
adding a subscript i to the variables, such that x l ,y l denotes a couple of 
corresponding members. Thus the following definitions, analogous to (7) and 
(8), obtain: 

x tJ yi € F* Xi V(7* f = d/ (%i e F) V (y t e G) narrower couple disjunct (13) 

Xifli e F* x% . Gy, = d/ (Xi € F) . (; iji € G) narrower couple conjunct (14) 

The construction of the classes (13) and (14) is equivalent to the transition 
from the propositional functions f(x) and g{y) to the new propositional func¬ 
tion of two variables, defined in terms of the coupling relation e(x,y ): 


/(£•) V g(y.) 

(15) 


(16) 
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To eliminate the subscript i, the coupling relation e(x,y) can be substituted; 

then the functions result: „ ...... „ ..... 

e(x,y).[f(x) V g{y)\ (17) 


e(x,y).f(x).g(y) 


(18) 


In application to probability relations, the notation by means of the sub¬ 
script will be used. 

An example of a narrower couple disjunct is the class of married couples 
one part of which belongs to family F or family G. The class of married 



Fig. 2. Implication class of F and G: 
according to (19), all points of plane, 
except shaded area, belong to implica¬ 
tion class. 



Fig. 3. Relation of class inclusion: F 
is included in O , according to (20a), 
(206), and (21). 


couples of which one part is white and the other colored, the class of marriages 
between whites and Negroes, is an example of a couple conjunct in the 
narrower sense. 

The classes (5) and (6) can be conceived as special cases of the classes 
(13) and (14), resulting when the coupling relation e(x,y) is the identity. 
Note that the couple disjunct and the couple conjunct can be defined even 
for a single class F. 

For operations with classes, we use the same symbols as for operations with 
propositions. This simplification is possible because of the isomorphism be¬ 
tween the calculus of classes and the calculus of propositions. The use of 
capital letters indicates that the operations standing between them are class 
operations and thus constitute classes, not propositions. The method can be 
extended to include further propositional operations; and, moreover, the 
transformations holding for propositions can be transferred to classes. Thus 
the implication class can be defined by the formula 

x e F D G = Df xtFMG (19) 

This means that the implication class F D G is the joint class of F and G. 
Assume that in figure 2 the domain of all meaningful arguments is repre- 
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sented by the points of the whole plane of the figure; then the implication 
class (19) is represented by all points of the plane except the shaded area. 

A particularly important case results if the implication class is identical 
with the entire domain of all meaningful arguments. We then have 

(x) (x e F D G) (20a) 

or. in another notation, / , r/ ^ * 

’ ’ (x)[(x € F) D (x € (?)] (206) 

Because of (2) this relation is identical with the general implication (3, § 6). 
An illustration of the resulting relation between the classes F and G is pre¬ 
sented in figure 3. The class F is included in the class G; this relation is called 
class inclusion. For it the symbolization 

F C G (21) 

is used. It should be noted that (21) is a proposition, not a class, since it 
means the same as (20a) or (206). For this reason the usual implication sign, 
which would establish a class, cannot be used. Its place is taken by the 
reversed implication sign, which, when placed between capital letters, has 
the same meaning as the usual implication sign between small letters. 

The class F included in G is also called a subclass of G. Since (21) is also 
true when we put F for <7, every class is its own subclass. This terminology 
is inescapable because every proposition implies itself. Examples of class 
inclusion are found in biological classifications: the class of lions is a subclass 
of the class of mammals. This notation, now generally accepted in symbolic 
logic, distinguishes clearly between the relations of class inclusion and class 
membership, which are confused in traditional logic. 

The class of all things is called universal class and is denoted by V. Its 
complement, the class that contains nothing and thus is empty, is the null 
class , denoted by A. The null class is a subclass of every class; this follows 
because a false proposition implies every proposition. 

Two classes F and G are called identical when the relation 

(x)[x e F ss x € G] (22) 

holds. We express the identity of classes by the symbol 

F = G (23) 

which is defined by (22). Like (21), the expression (23) is a proposition, not 
a class. The symbols for class inclusion and class identity enable us to write 
propositions in class notation. Thus for (21) we can write 

(F D (?) - V (24) 

The following relation is a tautology: 

(F V F) = V 


(25) 



38 


INTRODUCTION TO SYMBOLIC LOGIC 


Classes were introduced as extensions of propositional functions. It is often 
convenient to regard propositions as degenerate cases of functions, and to 
speak of the extension of a proposition. In correspondence with the relations 
(1G and 17, § G), the universal class will be regarded as the extension of a 
true proposition, the null class as the extension of a false proposition. 

For class identity, or identity of extension, it is sufficient that (22) is true. 
In order that the meanings of the symbols F and G be identical (identity of 
intension), it is required, moreover, that (22) be either analytic or contain a 
synthetic connective equivalence. 

§ 8. Axiomatic Systems 

Whereas the formulas of logic are tautologies, science is constructed from 
synthetic statements. The aim of the scientist is to make, not empty state¬ 
ments, but statements that inform us about the physical world; and the 
assertion that certain synthetic statements are true is the very task of science. 
The proof of the truth is based on experience and observation, sources of 
knowledge that play no part in logic. A mere collection of true synthetic 
statements, however, would not be called science; they must be ordered 
logically so as to form a deductive system. Some synthetic statements, called 
axioms, are placed at the top, and all other statements are derived from them 
by logical methods. The ideal form of a science is therefore the axiomatic 
system , an aim that individual sciences have attained more or less successfully. 

Derivations from synthetic premises are made by the same methods that 
are used for the derivation of tautologies. For the purpose of such derivations 
it is always permissible to add tautologies to synthetic premises; as tautologies 
are empty, their addition does not enlarge the empirical content of the axioms 
and thus cannot adulterate it. The process of derivation employs the rules 
of substitution, inference, replacement, free variables, and so on, that were 
presented above. By means of a logical device it is possible to reduce a deriva¬ 
tion from synthetic axioms to a derivation of tautologies. For this purpose 
the conjunction of synthetic axioms is conceived as a statement a. If b is 
the theorem to be derived, the implication a D b is then tautological. 1 The 
derivation of synthetic statements from synthetic axioms can then be re¬ 
placed by the derivation of tautologies of the particular form a D b, where a 
represents the given set of axioms. 

This conception of a derivation from synthetic premises shows clearly 
that the contribution of logic to science consists in the construction of tau¬ 
tologies, namely, of formulas of the form a D 6; the truth of the axioms a, 
however, is not a matter of logic. The contribution of mathematics to science 

1 The only precaution to be taken is that a must not contain free variables; the expression 
a must therefore be closed by means of binding the variables through all-operators. See ESL, 
pp. 105, 144, 237. 
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is to be interpreted in the same way. Mathematics, too, represents a method 
of deriving consequences from given sets of axioms, whereas the truth of the 
axioms is irrelevant for mathematics. Thus mathematics is concerned only 
with tautological implications. 1 therefore agree with Russell in regarding 
mathematics as a branch of logic, a branch of a complicated structure that 
has grown from the treatment of axiom systems of practical significance. 

Strangely enough, the assertion that mathematics deals only with tau¬ 
tologies has been regarded by some mathematicians as a disparagement of 
mathematical investigation. Such a judgment, of course, results from a mis¬ 
understanding, for the given definition of mathematics in no way diminishes 
the value of mathematical inquiry. The construction of complicated tau¬ 
tologies, as expressed in mathematical theorems, rather demands an extraordi¬ 
nary display of ingenuity and sagacity. Even if the tautology itself does not 
assert anything, the statement that a certain complicated symbol combination 
is a tautology can represent a discovery of the highest value. 

The given definition of mathematics determines the construction of the 
calculus of probability, which will be given in axiomatic form. Disregarding 
the question of the truth of the axioms, 1 shall first present the conclusions 
that can be inferred from them. In fact, the mathematical calculus of prob¬ 
ability is nothing but the deductive system that deals with the relations 
between axioms and theorems. The truth of the axioms will be discussed later. 

The axiom systems of mathematics, apart from being synthetic, differ 
from logical formulas in containing certain symbols that are not used in the 
purely logical calculus and for which a meaning is not specified. Like the 
truth of the axioms, the meaning of the unspecified symbols is not determined 
by logic or mathematics: the logical relations of an axiomatic system can be 
developed without anything being known about the meaning of the symbols. 
Thus the axioms of geometry contain symbols for the concepts “point”, 
“straight line”, “plane”, and so on; it is not necessary, however, to make 
use of the meaning of these symbols for the derivation of geometrical theorems. 
The derivation can be carried through in the formal conception of the axiomatic 
system. The formal conception differs from the one employed in § 5 in that 
the meanings of the logical symbols are regarded as known; only the non- 
logical, unspecified symbols are used formally, i.e., without reference to their 
meanings. 

In order to proceed to an interpreted conception, we assign meanings to 
the unspecified symbols. Thus the symbol “point”, used in the axiomatic 
system of geometry, is taken to mean a small piece of matter; “straight line”, 
a light ray, and so on. Since the meanings are not determined by the axiomatic 
system, they are arbitrary to a certain degree. In fact, the unspecified sym¬ 
bols may be used in different meanings, that is, may be coordinated to various 
kinds of objects. We therefore speak of different interpretations of the unspeci- 
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fied symbols. For instance, “point” may mean a number triplet; “straight 
line”, a pair of linear equations; “plane”, a linear equation. The interpre¬ 
tation then differs from the usual interpretation of the axiomatic system and 
results in analytic geometry. The rules that coordinate an interpretation to 
the unspecified symbols are called coordinative definitions. 

The interpretation, however, is not completely arbitrary. The objects 
coordinated to the unspecified symbols must satisfy the relations postulated 
in the axioms; only if this condition is fulfilled is an admissible interpretation 
constructed. If the coordinated objects are of an empirical nature, for in¬ 
stance, pieces of matter and light rays, the proof that the interpretation is 
admissible is given by reference to physical laws. If the coordinated objects 
are mathematical constructions, for instance, number triplets and equations, 
the admissibility of the interpretation is proved by mathematical laws. The 
plurality of interpretations explains why the same axiomatic system can be 
applied to various subjects. All these subject matters are, however, isomor - 
phous , that is, they have the same logical structure, a consequence expressed 
in the fact that they constitute admissible interpretations of the same axio¬ 
matic system. 

Only when the unspecified symbols are interpreted can the axioms be 
called true or false. So long as no interpretation is given, the axioms are 
neither true nor false, but represent definitions of the unspecified symbols. 
However, since the meaning of the latter is not completely determined by 
the axioms, the definitions merely set up certain relations that are to hold 
between the unspecified symbols; they define only certain structural prop¬ 
erties for these symbols. Such definitions are called implicit definitions. They 
cannot be written in the form of explicit definitions (see p. 19); for implicit 
definitions a separation of the definiendum from a definiens is not possible. 

The incomplete character of implicit definitions, so far as the meaning of 
the unspecified symbols is concerned, is demonstrated by the plurality of 
interpretations. When implicit definitions include more than one unspecified 
symbol, a further peculiarity results, since the question whether a certain 
physical object may be regarded as an interpretation of one of the symbols 
cannot be answered until the interpretations of the other unspecified symbols 
are added. If, for instance, “points” are taken to mean small objects, “straight 
lines” to mean light rays, and so on, a system of objects is obtained that fits 
the implicit definitions of these concepts in the geometrical axiom system. 
If, however, “straight fines” are understood to be a pair of linear equations, 
this interpretation cannot be combined with the coordinative definition “small 
object” for point. Implicit definitions are, therefore, of a logical nature very 
different from that of explicit definitions. They are, so to say, hollow forms 
that can be filled with different materials. 

The coordination of an interpretation to an axiomatic system is also called 
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the construction of a model. Thus the system of formulas of analytic geometry 
represents a model of the axiomatic system of Euclidean geometry. Geome¬ 
tricians have also developed models of non-Euclidean systems of axioms, as, 
for instance, the so-called Klein model of the Bolyai-Lobachevski geometry. 

After the construction of an axiomatic system, the question of its con¬ 
sistency must be considered. If it is possible to coordinate to an axiomatic 
system another system in the sense of an interpretation, the consistency of 
the first system is reduced to that of the second. Thus the use of analytic 
geometry as a model of Euclidean geometry proves that the latter is consistent 
in the same sense as arithmetic, since analytic geometry represents a part of 
arithmetic. In other words, if the consistency of arithmetic is established, 
the consistency of Euclidean geometry is established likewise. Such reduc¬ 
tions have great practical significance, but their ultimate value depends on 
the reasons that can be adduced for the consistency of certain of the simplest 
systems of axioms, such as the systems of logic and arithmetic. A discussion 
of these problems, however, is beyond the province of this book. Suffice it 
to say that the consistency of the calculus of propositions and of the simple 
calculus of functions can be proved, whereas the consistency of the higher 
calculus of functions, and with it that of mathematics, is still an unsolved 
problem. 
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§ 9. The Probability Implication 

The investigation of the concept of probability begins with an analysis of 
the logical structure of probability statements. The problem, which so far 
has not been given sufficient attention in the mathematical calculus of prob¬ 
ability, is amenable to precise solution with symbolic methods. Symbolic 
logic has devised means of characterizing the logical form of a statement 
without regard to its content; these methods can be extended to include a 
characterization of probability statements. The formalization of the prob¬ 
ability statement, in fact, is one of the first objectives in the philosophy of 
probability. 

To consider a typical probability statement: when a die is thrown, the 
appearance of face 1 is to be expected with the probability £. This statement 
has the logical form of a relation. It is not asserted unconditionally that face 1 
will appear with the probability the assertion, rather, is subject to the 
condition that the die be thrown. If it is thrown, the occurrence of face 1 
is to be expected with the probability J; this is the form in which the prob¬ 
ability statement is asserted. No one would say that the probability of finding 
a die on the table with face 1 up has the value &, if the die had not been 
thrown. Probability statements therefore have the character of an implica¬ 
tion; they contain a first term and a second term, and the relation of prob¬ 
ability is asserted to hold between these terms. This relation may be called 
probability implication. It is represented by the symbol 

-> 

V 

This is the only new symbol that the probability calculus adds to the symbols 
of the calculus of logic. Its connection with logical implication is indicated 
by the form of the symbol: a bar is drawn across the sign of logical implica¬ 
tion. Whereas the logical implication corresponds to statements of the kind, 
“If a is true, then b is true”, the probability implication expresses statements 
of the kind, “If a is true, then b is probable to the degree p”. 

The terms between which the probability implication holds are usually 
events. Let x be the event, “The die is thrown”, and y the event, “The die 
has come to rest on the table”; then a probability implication between the 
two events is asserted. We recognize at once that this requires a more exact 
formulation. We speak of a definite probability only when the event is char- 


[45] 



4G 


ELEMENTARY CALCULUS OF PROBABILITY 


acterized in a certain manner, namely, as an event y in which face 1 is up. 
This means that the event y is regarded as belonging to a certain class B. 
We are dealing with a class, since the individual features of the event y are 
disregarded in the statement. It does not matter on what part of the table 
the die lies, or in which direction its edges point; only the attribute of having 
face 1 up is considered. Thus the (‘vent y is characterized only as to whether 
it can be said to belong to the class B. The same applies to the event x, since 
we do not consider with what force the die is thrown or what angular momen¬ 
tum is imparted to it; we demand only that x bo a throw of the die, that it 
belong to a certain class A. Therefore we write the probability statement 


in the form 


xeA-a-yeB 


( 1 ) 


p 


This formulation, however, requires modification. We must express the 
fact that the elements of the classes are given in a certain order, for instance, 
in the order of time. In other words, the event x belongs to the discrete 
sequence of the events Xi, X 2 , . . . x t . . . , while at the same time the event 
y belongs to a corresponding sequence ?/i, ?/a, - . . y t . . . There is a one-one 
correspondence between the elements of the two sequences, expressed by 
equality of subscripts, and we assert only a probability implication between 
the corresponding elements x», y ly so that we write, instead of (1), 

x l eA-^y l t B (2) 

V 


The coordination of the event sequences is necessary for the following reason. 
We do not wish to say that the probability implication holds, for instance, 
between the event x* of throwing the die and the event y l+ j of obtaining a 
certain result. When we merely state that the event x T of throwing the die 
occurs, we have not yet asserted that the event x,-+i of throwing the die will 
also occur and that, therefore, a probability for the occurrence of the event 
Vi+ 1 exists. 

However, even (2) does not completely represent the form of the prob¬ 
ability statement; we must add the assertion that the same probability im¬ 
plication holds for each pair x», ?y*. This generalization is expressible by two 
all-operators, meaning, “for all x 4 and for all y”. Using an abbreviation, 
we can reduce the two all-operators to one by placing only the subscript i 
in the parentheses of the operator. Thus the probability statement is written 

00 (x t - eA^-yieB) (3) 

V 


This expression represents the final form of the probability statement: The 
'probability statement is a general implication between statements concerning a 
class membership of the elements of certain given sequences . 
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To illustrate this formulation of the probability statement: a relation of 
the kind described is employed in dealing with the probability of a case of 
influenza leading to death. We do not speak unconditionally of the prob¬ 
ability of the death of the patient, but only of the probability resulting from 
the fact that he has contracted influenza. Here again are two classes—the 
class of influenza cases and the class of fatal cases—and the probability 
implication is asserted to hold between them. If x t is interpreted as a result 
of medical diagnosis, A as an influenza case, y x as the state of the patient 
after one week of illness, and B as the death of the patient, then this example 
of a probability statement from the field of medicine has the form (3). 

Another example is the probability of hitting a target during a rifle match. 
Here Xi represents the single shot, y L the hit scored at the target, B the class 
of hits within a certain range, and A the class to which the rifleman belongs 
according to his ability. The probability of a hit will be different according 
to the contestant’s degree of skill. Here again the probability is determined 
only when the classes A and B are chosen. 

An example from physics is the bombardment of nitrogen by a-rays, 
or helium nuclei. There is a certain probability that a helium nucleus will 
eject a hydrogen nucleus from the nitrogen atom. Let A represent the class 
of a-rays, x x the hit of an individual helium nucleus, and y x the event 
produced by it. The event results in the occasional emission of a hydrogen 
nucleus, that is, it belongs to the class B. Although it is not possible to observe 
directly the causal connection between the helium nucleus and the released 
hydrogen nucleus, we assume, nevertheless, a one-one correspondence between 
x t and yi . Using a very weak radioactive preparation that rarely emits helium 
nuclei, we can employ the temporal coincidence observed for the a-rays 
and the hydrogen rays as a criterion of the correspondence. 

In the previous examples, x t and stand in the relation of cause to effect, 
but other instances can easily be found in which y x represents the cause and 
Xi the effect. In this case we carry out a reverse inference, from the effect to 
the probability of a certain cause, for example, in investigating the cause of 
a cold. And there are other examples for which the relation Xi to y x is of a still 
different type. There exists a probability that a certain position of the barome¬ 
ter indicates rain, but there is no direct causal connection between the two 
events. In other words, one is not the cause of the other. Rather, the two 
events are effects produced by a common cause, which leads to their concat¬ 
enation in terms of probabilities. It is easily seen that these examples also 
conform to the logical structure of (3). 

The analysis presented shows that the probability implication can be re¬ 
garded as a relation between classes. The class A will be called the reference 
class; the class B , the attribute class. It is the probability of the attribute B 
that is considered with reference to A. It must be added, however, that the 
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probability relation between the two classes A and B is determined only 
after the elements of the classes are put into a one-one correspondence and 
ordered in sequences. For instance, the probability implication holding be¬ 
tween the birth and the subsequent death of an infant—the rate of infant 
mortality—differs from one country to another, that is, it differs according 
to the sequence of events for which the statistics are tabulated. Even for an 
individual die there exists a particular pair x t yi of sequences, and it is an 
assertion derived from experience that the probability remains the same for 
different dice. Therefore, strictly speaking, the probability implication must 
be regarded as a three-term relation between two classes and a sequence pair. 
The pair of sequences provides the domain with respect to which the prob¬ 
ability implication assumes a determinate degree. Later the conception is 
extended to combinations of more than two sequences. The significance of 
the order of sequences is the subject of chapter 4. 

Because of the equivalence that exists between classes and propositional 
functions, formula (3) may be expressed in a somewhat different way. Ac¬ 
cording to (2, §7), we may use instead of the statement xe A the corre¬ 
sponding propositional functional/(x) and, similarly, instead of y e B, the 
corresponding propositional functional g{y). Then we must express the one-one 
correspondence between the sequences of x and y by a one-one functional 
e(x,y) in order to determine for each x the corresponding value y. Thus (3) 
assumes the form 

.e(x,y) -e- g(y)] (4) 

V 

In this form it is not necessary to employ the subscript i, if the order of the 
elements is regarded as understood. 

A special kind of probability implication is included in the general form 
(3) or (4). It may happen that the sequences coincide and that the elements 
Xi and yi are identical. The function e(x,y) then reduces to the identity 
relation. We thus obtain, instead of (3) and (4), 


( i ) (x< B k ) 

V 

(5) 

(x) [/(x) -=► g(x)] 

(6) 


V 


Since it refers to a probability implication within the same sequence, this 
form will be called an internal 'probability implication. It is employed in many 
important problems of probability, particularly in social statistics. Examples 
are the probability that an inhabitant of Bavaria suffers from goiter, or that 
a new-born baby is a boy. In such cases x t - is not represented by an event 
but by a person or an object that may possess the two properties B and 
B k simultaneously. In more strictly statistical applications, the internal form 
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of the probability statement prevails to so high a degree that it is usually 
made the basis of the probability calculus. Yet it would not be advisable to 
restrict the probability statement to this special form, since there are numerous 
other cases in which the more general types (3) or (4) are used. In particular, 
the application of the probability concept to the causal connection of events 
would be impossible if it were not based on the more general form of the 
probability statement as given above. 

§ 10. The Abbreviated Notation 

The form of the probability statement as given in (3, § 9) is rather compli¬ 
cated. An abbreviated notation, therefore, will be used for the development 
of the calculus of probability. Abbreviation is possible because certain prop¬ 
erties of formula (3, § 9) occur in all probability statements in a similar man¬ 
ner, and can be suppressed in a simplified notation. 

The probability statement has been written, so far, 

(i) (Xi c A -e- yi c B) (1) 

p 

This formula will be abbreviated to the form 

(A -a- B) (2) 

V 

The transition from the abbreviated to the detailed notation is controlled by 
the following rule: 

Rule of translation. For every capital letter K substitute the expression 
Xi e K, using for different capital letters different variables x ly with the 

subscript i, but the same variable x l for the capital letters K h K 2 . . . . In front 
of all parentheses containing capital letters place the symbol i within an all- 
operator. 

The method of abbreviation, as is seen from the rule, amounts to leaving 
out the specification of the sequence pair, an omission that is permissible 
because in probability statements the elements of the sequence pair never 
occur as free, but always as bound, variables. In the abbreviated notation, 
parentheses play the part of the all-operator; therefore, brackets must be 
used if generalization is not to be indicated. Furthermore, the difference 
between the two kinds of negation that exist for general statements is ex¬ 
pressed as follows: in one case the negation bar is placed only above the 
expression written within parentheses; in the other it is extended above the 
parentheses. We thus define 

(A -B- B) = D f (i) (Xi eA-z-yitB) (3) 

p p 

(A -3- B) - D/ ( i) (Xi £ A -9- Hi f B ) 

p P 


(4) 
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The use of parentheses for the expression of the generalization applies also 
to formulas not containing the sign of the probability implication, and allows 
us to go from a class to a statement. Thus, A D B is a class, and (A D B) 
is a statement; according to the rule of translation, this statement has the 
form (206, § 7) and is therefore identical with A C B. Adding parentheses 
to a class symbol means, in this notation, that the class is identical with the 
universal class and thus leads to the meaning expressed explicitly in (24, § 7). 

If compound classes are used, like the class A D B, the rule of translation 
leads to the simple result: different capital letters mean narrower couple 
classes; equal capital letters with different subscripts mean simple classes. 
Couple classes containing implication or equivalence signs are interpreted 
by analogy with (13 and 14, § 7). The subscripts headed by circumflexes 
are dispensable for couple classes because their function is taken over by 
the difference of the capital letters. Class inclusion for different capital letters, 
i.e., for narrower couple classes, means a relation similar to the one illus¬ 
trated in figure 3, § 7, for which the two circles are drawn in different planes, 
one on top of the other; corresponding points represent the couples of ele¬ 
ments. Since for all practical purposes the narrower couple classes behave 
like simple classes, it is permissible to forget about the distinction for tech¬ 
nical manipulations. The treatment of the general probability implication is 
technically not different from that of the internal probability implication. 

A further abbreviation may be introduced. For many applications, par¬ 
ticularly in mathematical calculations, we must solve the probability impli¬ 
cation (2) for the degree p. We denote the degree p by P(A,B ), reading this 
symbol as “the probability from A to B”. Some writers call this “the relative 
probability of B with respect to A”. But in the present notation, the natural 
order, from the known to the unknown element of the relation, is used, thus 
introducing the same order of terms that is used in the implication a Db. 
The expression “probability from A to B” has the same grammatical form 
as the geometrical expression “distance from A to B ”, which also designates 
a relation. The order shows clearly that probabilities are treated as relations, 
in correspondence with the definition given in § 9. The calculus of probability 
in its usual form includes absolute as well as relative probabilities. The word 
“absolute” must be interpreted merely as an abbreviated notation, applying 
when the first term, the reference class, is dropped as being understood. 
Thus when it is said that there is the absolute probability | for a face of the 
die, it is understood that the reference class is represented by the throwing 
of the die. This suppression of a first term has led to some confusion. 

Instead of (2), then, the equation is written 


P(A,B) = p 
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The p-symbol is a numerical functor, that is, a functional variable the special 
values of which are numbers. 1 It leads to statements only when it is used 
within mathematical equations. The P-symbol need not be considered as a 
primitive symbol; it can be reduced to the symbol of the probability impli¬ 
cation by the definition 

[P(A,B) -p] = z>/ (A^B) (5) 


The symbol P(A,B) itself is not defined—only the expression P(A,P) = p. 
This is permissible since the symbol P(A,B) never occurs alone, but only in 
such equations. Thus a mere definition in use is given for P(A,B). The equality 
sign used with this symbol represents arithmetical equality, i.e., equality 
between numbers. In the foregoing account of symbolic logic the sign was 
not explained because the rather complicated connection between logic and 
arithmetic could not be demonstrated. It may suffice to say that mathematical 
equality can be reduced to the basic logical operations. 2 The negation of a 
statement of mathematical equality is denoted by the inequality sign 
The notation by means of the P-symbol is called mathematical notation; that 

in terms of the ^--symbol, im.plicational notation, 
v 

Another abbreviation is now introduced. Sometimes we omit the statement 
of the degree of probability and write 

(A*B) (6) 

This relation is called indeterminate probability implication. Since it is not 
permissible simply to drop one constituent within a formula, a definition 
must be used to connect (6) with the symbols previously defined: 

(A*B) = Df Qp)(A*B) (7) 


The abbreviation (6) therefore means, “There is a p such that there exists 
between A and B a determinate probability implication of the degree p”. 
Passing from (6) to the detailed notation we obtain, according to the rule 

f tran, (A B) = Df (3 p) ( i ) (x< e A -s- y< e B ) ( 8 ) 


The all-operator is placed after the existential operator, so that (8) represents 
the stronger form in the sense of (9, § 6). 

The value p is often written within separate parentheses behind the prob- 
ability implication: w , , (9) 


(3 q ) (A -9- B). (q = p) 

9 


1 See ESL, p. 312. 

8 It is an identity of classes of a higher type. See ibid., § 44. 
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This is merely a more convenient way of writing and has the same meaning 
as (2). We need this form because we shall later obtain for the probability 
degree p expressions that are too involved to be written as subscripts of the 
symbol of the probability implication. The resulting parentheses in the ex¬ 
pression (q — p) do not indicate an all-operator for the detailed notation 
because they do not contain capital letters. 

The abbreviations given in this section will be useful in the following pres¬ 
entation of the theory of probability. In particular, it is an advantage that 
even in the abbreviated notation the symbols of the propositional operations 
can be manipulated according to the rules of the propositional calculus, 
although these symbols are placed between class symbols (that is, between 
capital letters) and thus represent class operations. This is possible because 
of the isomorphism of the two calculi (see § 7). 

§11. The Rule of Existence 

The formal structure of probability statements has been explained, but nothing 
has been said so far about their meaning. The laws of the probability impli¬ 
cation can be completely developed, however, without interpretation. Dis¬ 
cussion of the problem of interpretation will be deferred to a later section. 

As a consequence, a method cannot yet be provided whereby we can deter¬ 
mine whether, if two classes are given, a probability implication holds between 
them; in other words, we cannot yet ascertain the existence of a probability 
implication. However, this impossibility need not disturb us at this point. 
We assume the existence of some probability implications to be given; and 
we deal only with the question of how to derive new probability implications 
from the given ones. This operation exhausts the purpose of the probability 
calculus. 

The existence of a probability implication I regard, in general, as a syn¬ 
thetic statement that cannot be proved by the calculus. The calculus can 
only transfer the existence character; with its help we can infer, from the 
known existence of certain probability implications, the existence of new ones. 
The property of transference by the calculus is, in part, directly expressed 
by the form of the axioms; some of the axioms, such as hi and iv, directly 
assert the existence of new probability implications if certain others are given. 
However, these particular cases of transference do not suffice; for the transfer 
property will be required in a more general manner, as will be seen later. We 
must be able to assert that whenever the numerical value of a probability 
implication is determined by given probability implications, this probability 
implication does exist. It will become obvious (§ 17) that this existence is 
not self-evident, but must be asserted separately. The following postulate is 
therefore introduced. 
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Rule of existence. If the numerical value p of a probability implication 
(A -a- Ii), provided the probability implication exists, is determined by given 

V 

probability implications according to the rules of the calculus, then this probability 
implication (A -a- B) exists. 

V 

The rule of existence is not an axiom of the calculus; it is a rule formulated 
in the metalanguage, analogous to the rule of inference or the rule of substi¬ 
tution (see § 5). It must be given an interpretation even in the formal treat¬ 
ment of the calculus. There must exist a formula that can be demonstrated 
in the calculus and that expresses the probability under consideration as a 
mathematical function of the given probabilities, with the qualification that 
the function be unique and free from singularities for the numerical values 
used. This is what is meant by the expression, “determined according to 
the rules of the calculus”. 1 

§ 12. The Axioms of Univocality and of Normalization 

From the discussion of the logical form we turn to the formulation of the 
laws of the probability implication. As explained above, an interpretation 
of probability is not required for this purpose. The laws will be formulated 
as a system of axioms for the probability implication—that is, as a system 
of logical formulas that, apart from logical symbols, contains only the symbol 
of the probability implication. Among the logical symbols, the logical impli¬ 
cation occurs, and is thus used in formulating the laws of the probability 
implication. 

The system to be constructed is called the system of axioms of the prob¬ 
ability calculus. The name is justified by the fact that it is possible to derive 
from these axioms the formulas that are actually used in all applications of 
the probability calculus. When, at a later stage, an interpretation of prob¬ 
ability is presented by means of statements about statistical frequencies, it 
will be possible to give another foundation to the axioms, showing that they 
are derivable from the given interpretation of probability. For the present, 
however, no use is made of the connection between probabilities and fre¬ 
quencies; instead, the axiom system is regarded as a system of formulas by 
which the properties of the probability concept are determined. By this 
procedure the axiomatic system of the probability calculus assumes a func¬ 
tion comparable to that of the axiomatic system of geometry, which, in a 
similar way, determines implicitly the properties of the basic concepts of 
geometry, that is, of the concepts “point”, “line”, “plane”, and so on (see § 8). 

1 The rule of existence can be replaced only incompletely by axioms. See footnote, p. 61. 
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We begin with the first two groups of axioms: 

( P 7‘q)D[(A>B).(A^B)^ (A)] 

v i 

1. (ADB)l(3p)(A*B).(p = 1) 

V 

2. (T) .(A^B)l(pS: 0) 

P 

Group ii will be discussed first. The degree of probability is asserted by 
n,2 to be a positive number, including 0 as an extreme case. That p cannot 
be greater than 1 is not incorporated into the axioms because it will be de¬ 
rived as a theorem in § 13. The normalization to values in the interval from 
0 to 1, end points included, is restricted to the case where the class A is not 

empty. The condition is expressed by the term (A), which means, according 
to the rule of translation (see p. 49), (i')(xie4), or, what is the same, 
e A). The significance of this condition will be explained presently. 

Axiom n,l establishes a connection between the logical implication and 
the probability implication. Whenever a logical implication exists between 
A and B , there exists also a probability implication of the degree 1; the 
converse does not hold, however. It follows from a simple consideration that 
the reverse relation cannot be maintained. For the demonstration we use the 
formula corresponding to n,l: 

(4D5)3(3p)(A^B).(p = 0) (1) 

V 

the necessity of which seems clear, though the exact derivation will be given 
later. 

Formula (1) states that whenever an impossibility exists, a probability 
implication of the degree 0 exists also. For this case it is easy to illustrate 
why the reverse condition cannot be required. For instance, if we prick a 
sheet of paper with a needle, the probability (at least for a mathematical 
idealization of the problem) of hitting a given point is equal to 0; nevertheless 
a certain point is hit each time. Thus the probability 0 does not entail impos¬ 
sibility. Consequently, in order to remain free of contradictions, we must 
assert that certainty does not follow from the probability 1. Rather, certainty 
and the probability 1 stand in the relation of a narrower to a more compre¬ 
hensive concept; certainty is a special case of the probability 1 (see § 18). 

The relation of the two concepts is thus made clear in a very simple man¬ 
ner; the mysterious conception, which is occasionally voiced, that certainty 
and the probability 1 are incomparable concepts is untenable. On the con¬ 
trary, the relation between the logical and the probability implication as 
expressed by ii,1 represents an important relation holding between the two 


I. Univocality 
II. Normalization 
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concepts, which connects the logic of the probability implication with classical 
logic. At this point the axiom system of probability differs from that of 
geometry. The concepts “point”, “line”, “plane”, and so on, occurring in 
geometry, are of a type different from that of logical concepts; for that 
reason they can never assume the meaning of logical concepts, even for a 
special case. 

The formulation of the univocality axiom i is clarified by the preceding 
remarks on the connection of the logical and the probability implication. It is 
obvious that the univocality of the degree of probability must be demanded. 
At first sight we might try to formulate univocality by 

( 2 ) 

P Q 


However, this formula leads to contradictions. They result from the fact that 
in ii,1 the logical implication was considered to be a special case of the prob¬ 
ability implication. Certain properties of the logical implication prevent the 
assertion of (2) with complete generality. This is due to an above-mentioned 
property of the logical adjunctive implication, according to which a false 
proposition implies any proposition. In logic this fact is expressed by the 
reduclio ad abaurdum (jl 3 B) m {A) (3) 


Formula (3) is a generalization of (1 g, § 4). It is proved by transforming the 
left side of (3) by means of (6a, § 4), applying (4c, § 4) and using (5d and 
5c, § 4). Addition of the parentheses, meaning extension to an all-statement, 
is of course always permissible for tautologies. Logic thus admits an am¬ 
biguity of logical implication, but this case is restricted to the condition (A). 
The ambiguity is transferred to the probability implication, since (3) with 
ii,1 and (1) lead to the relation 

(A)2(3 p)Gq){A*B).{A*B).(p = l).(«-0) (4) 

V Q 


In case of (A) being true, the right side of the formula is valid, in contra¬ 
diction to (2). Instead of (2) we therefore write axiom i, which brings the 
ambiguity of the probability implication into a form analogous to the ambigu¬ 
ity of logical implication. The condition p 5 * q must be written in front of 1 , 
since the expressions in brackets, contrary to (3), do not show whether we 
are dealing with different probability degrees. 

In order to clarify 1 , it may be remarked that this axiom has the same 
meaning as the following implications: 

(A*B).{A*B).(p 9 *q)l(A) 

P Q 

(A) D(A^B).(A^B) 

P Q 


(5) 

( 6 ) 
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These two formulas result when formula (7a, § 4) is used to dissolve the 
equivalence in i into implications going in both directions. In this case the 
expression {p q ) is dropped at the left side of (6); the condition is redun¬ 
dant because (6) holds likewise if the condition is not satisfied, that is, if 
p = q. From (6) is derived 


(A) D (A .=► B) 


(7) 


Since p can be chosen completely at random, the formula states that for the 
case (A) any degree of probability may be asserted to hold between A and B. 
Formula (7) goes beyond (4) so far as it extends the ambiguity to any chosen 
degree of probability, including even values greater than 1 or smaller than 0. 1 

The ambiguity thus admitted is harmless because it applies only to the 
case in which the first sequence does not contain a single element Xi belonging 
to the class A. This follows because, according to the translation rule, 


(A) Df 6 A) 


( 8 ) 


In the case (A), therefore, the probability cannot be used to determine expec¬ 
tations of the events B because the event A is never realized, and so the plu¬ 
rality of values cannot lead to practical inconveniences. It seems reasonable, 
in such a case, to consider the probability implication between A and B with 
respect to the sequence pair x$i as not defined at all and, therefore, to allow 
the assertion of any value for the degree of probability. This generalization 
of the probability concept extends it beyond practical needs; the extension 
is required because we wish to incorporate in the probability concept—as a 
special case—the logical implication as it is formulated in symbolic logic. The 
univocality, however, is always guaranteed if at least a single element Xi of 
the sequence belongs to the class A; it does not matter whether the corre¬ 
sponding yi belongs to the class B. For, using the tautological equivalence 
provided by the propositional calculus, 

o.5Dc = a.6Vc = aV6Vc = aVcV5 = a.cVfc = a.cD6 (9) 
and substituting for a; (A-^B).(A^ B) 

p g 

for b : (p 7 * q) 


for c: (A) 


( 10 ) 


we derive from (5) the formula 

(A^£).(A^£).(T)D(p = <?) (11) 

V Q 


1 The latter extension is necessary because otherwise the system of axioms would lead to 
contradictions, as J. C. C. McKinsey and S. C. Kleene have pointed out. See my note on 
probability implication in Bull. Amer. Math. Soc. f Vol. 47, No. 4 (1941), p. 2G5. It is for this 

reason that in this article I introduced for axiom n,2 the condition (A), which the German 
edition of this book does not contain. 
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When the double negation is removed and the translation rule (p. 49) and 
formula (13, § 0) are applied, we obtain 

(A^B).(A^B). ( 3 i) ( Xi t A) 0 (p = q) (12) 

P Q 

This means that the univocality of the degree of probability is guaranteed 
if there is at least one element x* that belongs to the class A. 

It is a result of the axioms i and n that the probability implication assumes 
the function of an extension of logical implication, the general implication 
introduced in (3, § 6). The latter is to be regarded as a special case of a prob¬ 
ability implication, as we may recognize particularly from the form (G, § 9). 
This conception permits a more precise formulation of the concept of physical 
law, which was interpreted above as a general implication (§6). Closer in¬ 
spection reveals that general implications that are absolutely certain can 
occur only if they are tautologies. The uncertainty of synthetic implications 
originates from the fact that any conceptual formulation of a physical event 
represents an idealization; the application of the idealized concept can possess 
only the character of probability (p. 8). The expression, “It follows according 
to a physical law”, must therefore be represented, strictly speaking, not by 
a general implication but by a probability implication of a high degree 
(see § 85). Upon this fact rests the great importance of the probability impli¬ 
cation: all laws of nature are probability implications. 

There is an important difference between logical implication and prob¬ 
ability implication. To the general implication ( A J B) corresponds an indi¬ 
vidual implication a D b, as defined by the truth tables 1 B (§4). For prob¬ 
ability implication such an individual relation is not used; the expression 

A -a- B, therefore, need not be considered as a meaningful expression. Only 

v 

in a fictitious sense can the degree of probability, holding for the entire 
sequence, be transferred to the individual case. Like the meaning of an indi¬ 
vidual connective implication of the synthetic kind (see § 6), that of an indi¬ 
vidual probability implication is constructed by a transfer of meaning from 
the general to the 'particular case. This transfer makes understandable why a 
frequency interpretation of the degree of probability can be applied to single 
events, though only in a fictitious sense. The problem will be considered 
later (see § 72). 

§ 13. The Theorem of Addition 

A well-known theorem of the probability calculus is that the probability of a 
logical sum is determined by the arithmetical sum of the individual prob¬ 
abilities, provided the events are mutually exclusive. For instance, the prob¬ 
ability of obtaining face 1 or 2 by throwing a die is calculated to be = 
For the addition it is essential that only one of the two faces can lie on top; 
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otherwise this manner of calculating would be unjustified. The theorem is 
usually called the theorem of addition , and it must now be formulated as an 
axiom. 

The condition of exclusion could be written in the form (B D C), but it is 
sufficient to use the weaker statement 


(A.BDC) 


(la) 


which can be derived from (B D C), whereas the latter formula is not derivable 
from (la). Although (la) appears to be nonsymmetrical with respect to B 
and C, this is actually not so; for, because of ( 6 a and 5a, § 4), formula (la) 
is equivalent to 


(A.CDfi) 




By the use of (la) the theorem of addition may be written as follows: 
III. Theorem of addition 

(A - 9 - B ). (A e- C) . (A . B D C) 0 (3 r) (A ^ B V C) . (r = p + q) 


The addition theorem is a formula that expresses the transfer property of 
the calculus: it states a rule according to which the character of existence 
is transferred. It asserts the existence of the probability implication for the 
logical sum, if the individual probability implications are given. Nonetheless, 
we recognize the indispensability of the rule of existence (§ 11). For it is the 
existence rule that permits us to reverse the addition theorem; with its help 
we can derive the theorem 

(A e- B) . (A B V C ). (A. B D C) D (3 q) (A -=► C ). (q = r - p) ( 2 ) 

V r g 


This theorem cannot be obtained from axiom hi alone, since the latter asserts 
existence only if the individual probabilities are given. The implicans of ( 2 ) 
differs from that of the axiom in that it contains only one individual prob¬ 
ability and, moreover, the probability of the logical sum. Yet we recognize 
that the degree q of the probability implication, stated on the right side of ( 2 ), 
is determined by the addition theorem, provided this probability implication 
exists. Because of the univocality axiom 1 , the probability q , if it exists, must 
assume a value that, when added to p, furnishes the value r, that is, q — r — p. 
Now we can apply the existence rule, and the existence of the probability 
implication (A - 3 - C ) can be asserted. 

Q 

The form of the relation (2) makes it clear that axiom hi can be only par¬ 
tially reversed. The existence of the probability of the logical sum is not 
sufficient for the reversal; one of the two individual probabilities must also 
be given. Otherwise the degree of probability, q f would be undetermined, and 
the existence rule would not be applicable. The restricting condition is neces- 
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sary because otherwise it would be possible to infer quite generally (A -> C), 

Q 

that is, the existence of a probability implication for any event. The tertium 
non datur (1c, §4) and the formula (3 r) {A -3- C V C). (r ~ 1), which is 

r 

obtained from it by the help of (8c, § 4) and axiom n,l, would give this result. 
The unwarranted generalization is made impossible by the existence rule, 
which demands that the probabilities under consideration be determined by 
those given. 

The idea expressed in (2) is of great importance in the logical construction 
of the probability calculus. It is the validity of reversed formulas like theorem 
(2) and thus of the existence rule upon which rests the possibility of operating 
with numerical values of probabilities according to the rules of algebra. 
When we no longer incorporate the condition of exclusion into the formula, 
stating it only in the context, we may write, introducing the P-notation, 

P(A,B V C) = P(A,B) + P(A,C) (3) 

With this way of writing we express the fact that the rules by which mathe¬ 
matical equations are manipulated can be applied to probability formulas. 
Thus it is permissible to proceed from (3) to the formula 

P(A,C) = P(A,B V C ) - P(A,B) (4) 

The admissibility of this step is expressed in theorem (2). We recognize that 
the mathematical symbolization of the probability calculus is made possible 
by a particular property of this calculus, a property that requires a special 
formulation. The property is expressed by the rule of existence in combina¬ 
tion with the axiom of univocality. 

Certain difficulties arise from the fact that we cannot incorporate into the 
mathematical symbolization the condition of exclusion, presupposed for (3) 
and (4), but must add it verbally. A formula that is not dependent on condi¬ 
tions to be added in the context will be developed later (see § 20). 

A remark must be made concerning the univocality of the P-symbol. 
Since univocality of a probability P(A,B) is restricted to the case that A is 
not empty, the P-symbol has only in this case the character of a numerical 
functor , a number variable determined by the argument in parentheses. In 
order to make equations like (3) hold also in the case of an empty class A, 
the convention is introduced that such equations then represent existential 
statements of the form, “There is a numerical value for the dependent prob¬ 
ability that satisfies the equation when the independent probabilities are 
given”. For instance, (3) states for an empty class A that, if for P(A,B) and 
P(A,C) any values are given, there is a probability value among those holding 
for P(A,P V C) that satisfies (3). All equations, in this case, will represent 
trivial statements, because, if A is empty, a probability with A in the first 
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term will have all real numbers as its values; the existential statement will 
therefore be trivially satisfied. The advantage of this convention is that it 
allows us to drop, for probability equations, the condition stating that A is 
not empty. The equations also hold in the contrary case, but then they say 
nothing. For the implicational mode of writing, no such convention is needed, 
since axiom hi and formula (2) are existential statements and lead to univocal 
values of the probabilities only if A is not empty. The convention as to the 
P-symbol is therefore in agreement with the rule of translation (p. 49). 

In the greater part of this book the mathematical notation will be employed. 
Except in this section and the next, the axioms formulated in the implica¬ 
tional notation will no longer be used as a basis for further derivations. Their 
place will be taken by theorems in the P-notation, derived from them. The 
transition to the P-notation restricts the logical operations to the inner part 
of the P-symbols. Supplementary remarks will be made in the context when¬ 
ever other restricting conditions, on which the validity of the formulas 
depends, are added. 

We now derive a few theorems that have been used in the preceding section. 
Because of the tertium non datur , the formula (A D B V B) is always true, 
and we obtain the general formula 

Or) (A -B- B V B) .(r = 1) (5) 

r 


or, in the P-notation. 


P(A,B V B) = 1 


(S') 


We may therefore add formula (5) to (A -3- B). The conditions of theorem (2) 

p 

are satisfied if we substitute B for C, since (A .B D B) also is always valid. 
We thus obtain the theorem 

(A^B)D(ju) (A^B).(u = 1 - p) (6) 

p u 


In the P-notation the theorem is written 


P(A,B) + P(A,B) = 1 (7) 

This formula is called the rule of the complement . 

We can now demonstrate that the probability degree, for which we postu¬ 
lated in n,2 only the nonnegative character, can never become greater than 1. 
We can complement the term B by its negation to constitute a complete 
disjunction. Considering the fact expressed in n,2 that both probabilities 
occurring in (7) cannot be negative, we obtain from (7) the relation 

0 g P{A,B) ^ 1 (8) 

Furthermore, we have from n,l and (6) the theorem 
(AdB)d( 3 p) (A*B).(p = 0) 


( 9 ) 
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The mathematical symbolization of the calculus of probability may be 
illustrated by another problem. Given the three classes B 2 , B 3 , which are 
mutually exclusive but do not form a complete disjunction, and given the 
three probabilities 

V B 2 ) P(A,B 2 V £ 3 ) P{A,B Z V B x ) (10) 

we wish to infer from them the existence of the three individual probabilities 
P(A,B0 P(A,B 2 ) P(A,Bs) (11) 

Theorem (2) is not applicable, because none of the individual probabilities 
is known to exist. However, we obtain from the addition theorem the equations 

P(A,B0 + P(A,B 2 ) = P(A,Bi V B 2 ) 

P(A,B 2 ) + P(A,B Z ) = P(A,B 2 VB-s) ( 12 ) 

P(A,B 3 ) + P(A,£,) = P(A,B 3 V B 1 ) 

They can be solved for the individual probabilities: 

P(A,B X ) = J[P(A,B, v B t ) + P(A,P 3 V fl,) - P(A,B 2 V B*)] 

P(A,P 2 ) - i[P(A,Bi v B 2 ) + P(A,B 2 V B 9 ) - P(A,P 3 V BJ] (13) 

P(A,P 3 ) - «P(A,B, VP,) + P(A,B 2 VP 3 ) - P(A,Bi V P 2 )] 

The three individual probabilities (11) are therefore determined according 
to (13) by the or-probabilities (10); and it follows from the rule of existence 
that when (10) is given, the existence of (11) is also assertable. Owing to the 
rule of existence, we can apply, in the calculus of probabilities, the procedure 
of eliminating unknown quantities from a system of equations and use it to 
find new existing probabilities. Probability equations, therefore, determine 
existence , that is, the existence of any of the probabilities occurring in an 
equation is secured if all the other probabilities are known to exist. 1 


§ 14. The Theorem of Multiplication 

The fourth and last group refers to an axiom that determines the probability 
of a combination of terms. It is a well-known theorem of the probability 
calculus that the probability of a combination—that is, the probability of a 

1 1 am indebted to E. Tornier for having called my attention to the fact that the problem 
formulated in (10) and (11) cannot be solved by means of the formulas given in my paper on 
probability published in Math. Zs Vol. 34 (1032), p. t568. In that article I did not use the 
existence rule, but gave special reversal axioms that permitted the derivation of such theo¬ 
rems as (2) and, thereby, the application of the calculus of algebraic equations. But it turned 
out that, in this system, the existence-determining character is not always conserved when 
variables are eliminated. Equations (12) determine existence for my former system also, but, 
equations (13) do not have this property. This fact led me to replace the reversal axioms by 
the rule of existence. 
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logical product—is represented by the arithmetical product of certain indi¬ 
vidual probabilities. This is the multiplication theorem of the probability 
calculus. The theorem is formulated by the following axiom : 

IV. Theorem of multiplication 

(A^B).(A.B^C) D (3 w) (A-z-B.C).(w = p - u) 

p U to 

For the first time we deal with probability expressions in which the prob¬ 
ability implication refers to three different classes, two of them occurring 
either in the first or in the second term. This does not cause any difficulty, 
because the translation rule (p. 49) determines the transition to the detailed 
notation for formulas of this kind also. In this case the domain of the prob¬ 
ability implication is a triplet of sequences. 

By a procedure of the kind used for the theorem of addition we can derive 
the converse of the multiplication theorem. We obtain two different conver¬ 
sions, since the three events A,B,C do not occur symmetrically in iv, whereas 
hi is symmetrical with respect to B and C : 

{A^B).(A^B.C)D(3u) (A.B^C).(u = ( 1 ) 

p w u \ V / 

(A.B*. C).{A B.C) D (3 V) U ■> B).(p = (2) 

The proof of the theorems is based on the rule of existence, which applies 
because it can be demonstrated that the probability implications occurring 
on the right in ( 1 ) and ( 2 ) are determined by those on the left. Because of 
theorems ( 1 ) and ( 2 ), axiom iv can be replaced by the more comprehensive 
formula, written in the P-notation, 

P(A,B.C ) = P(A,B) • P(A.B,C) (3) 

Theorems (1) and ( 2 ) mean that formula (3) can be solved according to the 
rules for mathematical equations for each of the individual probabilities 
occurring. Here again it is seen that the mathematical formalization of the 
probability calculus depends on the validity of the existence rule, as ex¬ 
plained in §13. 

Formula (3) is always true and does not require any restricting condition 
to be added verbally in the context, as was necessary for (3, § 13). Formula 
( 3 ) will therefore be used in further discussion of the theorem of multiplica¬ 
tion, without going back to axiom iv. The form selected here for theorem (3), 
characterized by the occurrence of three classes and of a term having two 
classes in the place of the reference class, has long been applied in the British 
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and the American literature. 1 It has been used in the axiomatic construction 
in this work because only in this form is the axiom always correct. The 
probability from A to the logical product B . C can be calculated only if the 
probability from A to B as well as that from A .B to C is given. 

In mathematical presentations the probability P(A.B,C) is usually called 
“the relative probability of C with respect to B” . This notation does not 
seem advisable because all probabilities are relative, and, furthermore, be¬ 
cause the probability under consideration cannot be characterized by B and C 
alone but requires class A also. 

For example, the probability that a person suffering from diphtheria sub¬ 
sequently contracts nephritis and dies is represented by a probability of the 
form P(A,B.C ), A denoting diphtheria; B , nephritis; and C, death. The 
probability is calculated as the product of the probability that a person 
suffering from diphtheria contracts nephritis, and the probability that a per¬ 
son dies who gets nephritis after having had diphtheria. The latter prob¬ 
ability is different from the one that a person suffering from nephritis will 
die, since a patient who has had diphtheria is weakened and therefore is in 
greater peril of losing his life. This consideration shows w’hy the last prob¬ 
ability occurring in (3) must be characterized by three classes. 

Another example is the probability that a thunderstorm follows a hot 
summer day with a subsequent change in the weather, which splits up into 
the product of two probabilities: the probability that a thunderstorm will 
follow a hot day and the probability that a change in the weather will follow 
a thunderstorm that was preceded by a hot day. The second probability is 
smaller than the probability that any thunderstorm brings with it a change 
in the weather, because the convective thunderstorms produced by local heat 
conditions usually do not result in a change in the weather, in contradistinc¬ 
tion to frontal thunderstorms. The example illustrates once more the necessity 
of characterizing by three classes the probability that occurs in the last 
term of (3). 

It must be regarded as a special case if two classes suffice for this term— 
a case arising when the actual three-class probability is equal to a certain 
two-class probability. Such specialization results if 

P{A.B,C) =P(A,C) (4) 

Then (3) assumes the form of the special theorem of multiplication: 

_ P(A,B.C) = P(A } B) • P{A£) (5) 

1 In 1878 the form was used by C. S. Peirce. See his Collected Papers (Cambridge, Mass., 
1932), Vol. II, p. 415. J. M. Keynes also employed the form in A Treatise on Probability 
(London, 1921), chap, xi, p. 6. The use of relative probabilities for the determination of de¬ 
pendent events is, of course, much older. P. 8. Laplace gives a corresponding rule in his Essai 
philosophique sur les probaoilitds (Paris, 1814), chapter on “Principes g6n6raux, quatrieme 
principe.” But he uses only two classes, my classes B and C, suppressing the general refer¬ 
ence class A. 
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The condition (4) is paraphrased by the statement: the events B and C are 
mutually independent with respect to A (see also § 23). For example, the prob¬ 
ability that a sudden gust of wind will capsize two sailboats is obtained as 
the product of the probability that the wind overturns one boat by the cor¬ 
responding probability concerning the other boat. The two probabilities need 
not be the same, since the two sailboats may be of different construction. 
It is, however, necessary for (5) that the probability of the second boat’s 
turning over be independent of whether the first boat turns over. 

Another specialization of (3) is obtained if A can be represented as the 
product of two events A\ and A 2 such that 

P(A 1 .A 2 ,B) = P(Ai,B) P(A 1 .A 2 .5,C) = P(A 2 .B,C) (6) 

In this case (3) leads to 


P(A,.A 2) B.C) = P(A h B) * P(A 2 .B,C) (7) 

If we add the specialization analogous to (4) 

P(A 2 .B,C) = P(A 2 ,C) (8) 

wc obtain P(A x .A it B.C) = P(A h B) • P(A 2 ,C) (9) 

This case may be illustrated by the throwing of two dice: Ai refers to the 
throwing of one die and A 2 to the throwing of the other. However, (9) would 
not be permissible without the conditions (6) and (8). 

A third specialization results if 


Then (3) becomes 


P(A.B,C) = P(B,C) 
P(A,B.C) = P(A,B) • P(B,C ) 


00 ) 

(ID 


Examples of this kind occur in certain causal chains: A may be represented 
by the occurrence of a storm; B , the falling of a tree; C, an accident caused 
by the falling tree. For the application of (11), however, we must inquire in 
each case whether (10) is satisfied. 

The preceding discussion reveals that specializations of the multiplication 
theorem—some of which are used as axioms in representations of the prob¬ 
ability calculus—do not provide formulas that are always true. They result 
from the general form (3) only for special cases. The latter are characterized 
by the equality of certain probabilities having different references classes, 
as stated in (4), (6), (8), (10). It follows that the question whether one of the 
special forms of the multiplication theorem can be applied is reduced to a 
question of the same type as that of how to determine the numerical value 
of a probability. It is always known whether two probabilities are equal 
when the probabilities themselves are known. Using the general form (3), 
or the form of axiom iv, for the theorem of multiplication eliminates certain 
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logical difficulties that were connected with this theorem in the history of the 
calculus of probability. 

§ 15. Reduction of the Multiplication Theorem 
to a Weaker Axiom 

The theorem of multiplication is not independent of the other axioms; it 
can be reduced to a weaker assumption. In order to show this dependence 
I shall make use of the fact that the multiplication theorem can be split into 
two separate assertions. The first partial assertion states that the probability 
P(A,B.C) is determined by P(A,B) and by P(A.B,C)\ the second assertion 
is that P(A,B C) is obtained, in particular, by the arithmetical multiplication 
of the two probabilities. The second assertion need not be stated explicitly 
as an axiom, but can be derived from the calculus with the use of the other 
axioms. 

To prove this contention, multiplication theorem iv is replaced by the 
weaker axiom 

I Vo. {A -9- B ). {A . B ^ C) D (3 w) (A ^ B. C) . [w = f(p,u)] 

p u w 

Here / stands for a mathematical function, temporarily undefined, that is to 
determine for any values p,u the corresponding w and, conversely, is required 
to be solvable unambiguously for p and u. Similarly to (1 and 2, § 14), it 
can be shown that the probability implication written at the right in these 
theorems assumes the degree of probability corresponding to the solution of 
w = f(p,u) for p and u respectively; in these theorems the probability on 
the right side is replaced by 

p = f'(w,u) and u = f"(w,p ), respectively, (1) 

where/' and/" represent the functions obtained by the solution. In this way 
it can be shown analogous to (3, § 14) that we may write 

P(A,B.C ) */[P(A,B),P(A.B,C)] (2) 

The function / is the function occurring in iva, and the comma between the 
probability symbols separates the two arguments of this function; that is, 
it serves as the comma between the arguments of a mathematical function. 

In order to infer the form of / from (2), we substitute for C the disjunction 
of two mutually exclusive events C and D; then (2) becomes 

P(A,B.[CVD]) =f[P(A,B),P(A.B,CvD)] (3) 

According to the first distributive law (4a, § 4), we dissolve 

{B\CMD] = B.CVB.D ) 


(4) 
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and apply to both sides of equation (3) the addition theorem (3, § 13): 


P{A,B .[(7 V Z>]) = P(A,B .C V B. D) = P(A ,B.C) + P(A,B. D) (5<>) 
P(A . B,C V D) = P(A . B,C ) + P(A . B,D) (5 b) 

The probabilities of the logical products occurring in (5a) are dissolved again 
according to (2): p(A,B.C) = f[P(A,B), P(A .B,C )] 

P(A,B.D ) =f[P(A,B),P(A.B,D)} (G) 

Thus (3) is transformed into 

f[P(A,B), P{A.B,C )] + f[P(A,B), P(A .B,D)] 

= f[P(A,B), P(A . B,C ) + P(A . B,D)\ (7) 

Using the abbreviations 

P{A,B) = p P(A.B,C) = u P(A B,I)) = v (8) 

we can write (7) as , r , . , r i rr , i /A \ 

f[p,u] + f[p,v] = f[p,u + r] (9) 


This is a functional equation for /; if it is to be valid for any values u and v 
the function / must have the form 


f[p,u] = g(p) ■ U 


( 10 ) 


where g(p) represents a function of p alone, which remains undetermined for 
the time being. 1 


In (2) we now substitute [C V C ] for C; then (2) becomes 
P(A,B .[C V C ]) = f[P(A,B),P(A B,C V C)) 
According to (5c, § 4), we have 

and therefore (S.[CvC] = B ) 


01 ) 

( 12 ) 


P(A,B.[C V C}) = P{A,B) = p P(A .B,C V C) = 1 (13) 


1 1 refer to a well-known theorem of mathematics. It may be proved as follows: we put 
u — 0; then we derive from (9) that/(p,0) = 0. Assuming v to be the differential increase du, 

we write (9). /lp,0 + du] - /[p,0] = f[p,u + du] - f[p,u] 


Dividing by du, we obtain for the limit du — 0 the differential equation 

f df[p,u} \ ^ f dfly,u] \ 

\ du /0 \ du Ju 

The subscript marks the argument-place at which the differential quotient is to be formed. 
Since u can be chosen at random, the equation states the differential quotient for u to be 
constant; that is, the function / is linear with respect to u. It is even possible to drop the 
assumption that the function / is differentiable and continuous, but the proof will then be 
more complicated. 
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Using these results in combination with (10), we transform (11) into 

v =/Ip, 1] = g(p) • i = g(v) (14) 

With this determination of g(p ), the relation (10) assumes the form 

f(p,u) = p • u (15) 

Because of (2) and (8) this means 

P(A,B.C) = P(A,B) • P(A.B,C) (16) 

Thus we have proved the multiplication theorem (3, § 14). 

It is seen from this demonstration that the theorem of multiplication repre¬ 
sents a necessary formula within the frame of the calculus of probability. 
That the probability of the logical product is given by an arithmetical product 
is a consequence of the fact that the probability of a logical sum is given by 
an arithmetical sum, in combination with the first distributive law of logic. 

The result enables us to introduce a new definition of the property of inde¬ 
pendence, defined in (4, § 14) or (5, § 14). Combining (4, § 14) with (2), we 
may define independence as follows. 2 Two events are independent with 
respect to A if the probability from A to their logical product is a function 
of their individual probabilities with respect to A alone, that is, if 

P(A,B.C) =f[P(A,B),P(A,C )] (17) 

It then follows that / assumes the form of the arithmetical product. This 
characterization of independence is very instructive; it states that the prob¬ 
ability of the combination of independent events is determined whenever 
the probabilities of the separate events are given. For instance, the probability 
l for each of two dice determines the probability for the combination of 
any two faces. 

§ 16. The Frequency Interpretation 

Axioms i to iv suffice to derive all the theorems of the calculus in which 
probability sequences occur as wholes the structure of which is not considered. 
The totality of these theorems is called the elementary calculus of probability. 
With the given axioms we therefore control the formal structure of the ele¬ 
mentary calculus of probability. But before developing the theorems of this 
calculus we wish to give the probability concept an interpretation over and 
above the characterization of its formal structure (see §8). 

This leads to a problem that has been under much discussion. The formal 
structure of the probability calculus that I have developed might be conceded 

2 1 am indebted to Kurt Grelling for the suggestion that independence can be character¬ 
ized in this manner; he thereby directed my attention to the foregoing proof for the product 
form of the function /. 
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by adherents of the most diverse theories about probability. But the question 
of the interpretation of the probability concept can be answered only on the 
basis of painstaking philosophical investigations, and different theories have 
answered it in different ways. It will be treated, therefore, in more detail 
later (see chap. 9). 

The laws of the calculus of probability are difficult to understand, however, 
if one does not envisage a definite interpretation. Thus, for didactic reasons, 
an interpretation of the probability concept must be added, at this point, to 
the axiomatic construction. But this method will not prejudice later investi¬ 
gations of the problem. The interpretation is employed merely as a means 
of illustrating the system of formal laws of the probability concept, and it 
will always be possible to separate the conceptual system from the interpre¬ 
tation, because, for the derivation of theorems, the axioms will be used in the 
sense of merely formal statements, without reference to the interpretation. 

This presentation follows a method applied in the teaching of geometry, 
where the conceptual formulation of geometrical axioms is always accom¬ 
panied by spatial imagery. Although logical precision requires that the prem¬ 
ises of the inferences be restricted to the meaning given in the conceptual 
formulation, the interpretation is used as a parallel meaning in order to make 
the conceptual part easier to understand. The method of teaching thus 
follows the historical path of the development of geometry, since, historically 
speaking, the separation of the conceptual system of geometry from its inter¬ 
pretation is a later discovery. The history of the calculus of probability has 
followed a similar path. The mathematicians who developed the laws of this 
calculus in the seventeenth and eighteenth centuries always had in mind an 
interpretation of probability, usually the frequency interpretation, though it 
was sometimes accompanied by other interpretations. 

In order to develop the frequency interpretation, we define probability as 
the limit of a frequency within an infinite sequence. The definition follows a 
path that was pointed out by S. D. Poisson 1 in 1837. In 1854 it was used 
by George Boole, 2 and in recent times it was brought to the fore by Richard 
von Mises, 3 who defended it successfully against critical objections. 

The following notation will be used for the formulation of the frequency 
interpretation. In order to secure sufficient generality for the definition, we 
shall not yet assume that all elements Xi of the sequence belong to the class A. 
We assume, therefore, that the sequence is interspersed with elements of 
a different kind. For instance, the sequence of throws of a coin may be inter¬ 
spersed with throws of a second coin. In this case only certain elements Xi 

1 Recherches sur la probability des jugements en matifre criminelle et en matQre civile . . . 
(Paris, 1837). 

2 The Laws of Thought (London, 1854), p. 295. 

8 “Grundla^en der Wakrscheinlichkeitsrechnung,” in Math. Zs. } Vol. V (1919), p. 52, and 
later publications. 
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will belong to the class A, if the class is defined as representing the throws of 
one of the coins only. Similarly, only some among the elements y; will belong 
to the class B , which may signify the occurrence of tails lying up. It may 
happen that y»- represents a case of tails up, whereas the corresponding x x 
does not belong to the class A , that is, the event of tails lying up is produced 
by the second coin. When the frequency is counted out in such a sequence 
pair, the result is expressed by the symbol 


N (x it A) (la) 

»-i 

which means the number of such Xi between 1 and n that satisfy XitA. 
The symbol is extended correspondingly to apply to different variables and 
to different classes and also to a pair, a triplet, and so on, of variables. For 
instance, the expression n 

N (x { € A).('!/i€ B) (16) 

i— 1 

represents the number of pairs such that x % belongs to A and simul¬ 
taneously iji belongs to B; it signifies the number of pairs x % ,yi that are ele¬ 
ments of the common class A and B. To abbreviate the notation, the following 
symbol is introduced; 

N n (A) = Df N (x t eA) N n (A.B) = Df N (x i eA).(y i eB) (2) 

i=i i 


Furthermore, the relative frequency F n (A,B) is defined by 


F n (A,B ) 


N n (A.B) 

N n (A) 


(3) 


In the special case in which all elements x t belong to the class A , that is, 
when the sequence Xi is compact , the denominator of the fraction is equal to n, 
whereas in the numerator the expression A may be dropped; then (3) assumes 
the simpler form 1 

F n (A,B) = i • N n (B) (4) 


With the help of the concept of relative frequency, the frequency interpre¬ 
tation of the concept of probability may be formulated: 

If for a sequence pair Xiy x the relative frequency F n (A,B ) goes toward a limit 
p for n-*~oo } the limit p is called the probability from A to B within the sequence 
pair. In other words, the following coordinative definition is introduced: 

P(A,B) = lim F"(A,B) (5) 

n->co 

No further statement is required concerning the properties of probability 
sequences. In particular, randomness (see § 30) need not be postulated. 
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§ 17. The Origin of Probability Statements 

So long as we regard the probability calculus as a formal calculus by means 
of which formulas are manipulated, that is, so long as we do not speak of the 
meaning of the formulas, the origin of probability statements presents no 
problem. The question whether the individual probability statement is true 
or false, then, is not a problem of the calculus, as was explained above. The 
calculus deals solely with transformations of probability statements; and the 
statements of the mathematical calculus, therefore, represent exclusively 
tautological implications of the type, “If certain probability implications 
a h . . . a n exist, then certain other probability implications b h . . . b n exist 
also”. I agree here with a conception emphasized by von Mises. 

But it would be a shortsighted attitude if mathematicians were induced by 
this conception to regard the question of the origin of probability statements 
as unreasonable. With the given definition of the probability calculus, the 
question is merely shifted to another field. At the very moment at which an 
interpretation is assigned to the probability statement, there arises the ques¬ 
tion how to know whether, in a given instance, a probability statement holds. 
It follows from the nature of the interpretation that the question is equivalent 
to the question how to ascertain the existence of a limit of an infinite sequence. 

Here an important distinction must be made. First, probability sequences 
may be regarded as mathematically given sequences, that is, as sequences 
that are defined by a rule. For instance, a probability sequence can be defined 
by means of an infinite decimal fraction in which every even number is 
regarded as the case B and every odd number as the case B. Whether such a 
sequence has a frequency limit and what the limit is, is a question of purely 
mathematical nature to be answered by means of the usual mathematical 
methods. It is important that we have at our disposal such mathematically 
given sequences representing the frequency interpretation; on occasion they 
will be used as models (see §§30 and 66). In the practical application of the 
probability calculus, however, they do not play a part. 

Second, sequences provided by events in nature may be considered. For 
such sequences, which include all practical applications of the calculus of 
probability, we do not know a definite law regarding the succession of their 
elements. Instead of a defining rule, we have a finite initial section of the 
sequence; therefore we cannot know, strictly speaking, toward what limit 
such a sequence will proceed. We assume, however, that the observed fre¬ 
quency will persist, within certain limits of exactness, for the infinite rest of 
the sequence. This inference, which is called inductive inference , leads to very 
difficult logical problems; and it will be one of the most important problems 
of this investigation to find a satisfactory explanation of the inference. For 
the present, however, the inference will not be questioned. Suffice it to say 
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that the inference is actually used—sometimes under the name a posteriori 
determination of a probability —by statisticians as well as in everyday life. We 
shall therefore use it, too, in problems of the application of the formulas 
constructed. 

It may sometimes be expedient, for mathematical reasons, to imagine a 
fictitious observer who can count out an infinite sequence and thus is able to 
determine its limit. But the picture serves only to illustrate certain logical 
relations and cannot replace the inductive inference where physical reality 
is concerned. 

To summarize: for the present we shall regard as verifiable an assertion 
stating that there exists a probability sequence of a determinate degree of 
probability. The verification may be derived either mathematically, from the 
defining rule of the sequence, or by means of an inductive inference. 

The given interpretation will now be used to elucidate some properties of 
the axiom system that so far, perhaps, have not been made sufficiently clear. 
First, we realize why the existence of an indeterminate probability implica¬ 
tion has been regarded as a synthetic statement requiring empirical proof. 
The assertion that there exists a limit of the frequency, even without specifi¬ 
cation of the degree, represents a definite statement that is certainly not 
satisfied for every sequence pair xjyi. For this reason the rule of existence is 
necessary within our formal system; when interpreted, it expresses the asser¬ 
tion that a limit of the frequency exists in the cases concerned. 

Second, we recognize that the indeterminate probability implication 
(A -3- B) states more than the existence of a mere possibility relation, which 
we write as {A DB). 1 The added meaning consists in the fact that the first 
statement asserts a certain regularity in the repetition of events. When a die 
is thrown upon a table, it is possible that a sudden thunderbolt may happen 
simultaneously; but such a statement of possibility does not mean that a 
probability implication exists between the two events. I do not wish to say 
that the probability is very small; I mean, rather, that it is not permissible 
to assert a definite regularity with respect to the occurrence of thunder when 
the die is thrown repeatedly. The illustration will make it clear that the 
existence of a probability cannot be inferred from the possibility of an event. 
But neither does the converse hold. From (1, § 12) it is seen that the pos¬ 
sibility of an event cannot be inferred from the existence of a probability. 
The probability can be equal to zero, and the probability zero may or may 
not represent impossibility. In neither direction does an implication hold 
between the two statements (A -3- B) and (A D B) . Probability and possibility 
are disparate concepts, that is, their extensions overlap. 

If we were to assert that a frequency limit must exist for any two repetitive 
events observed for a sufficiently long time, we would commit ourselves to a 


1 This is the extensions! possibility of § 80. 
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far-reaching hypothesis. On this assumption it would be possible to drop the 
existence rule; but, instead, we should have to introduce into the calculus 
an axiom of the form, “For all A and ( 7 , (A -3- C) is valid”. Obviously this 
addition would mean an extraordinary extension of the content of the calculus, 
w T ith which we do not wish to burden the axiom system. 

I therefore consider the assertion of a determinate as well as of an inde¬ 
terminate probability implication to be a synthetic statement, the validity 
of which can be ascertained, when physical events are concerned, by means 
of statistics in combination with inductive inferences. This method of ascer¬ 
tainment will not be questioned throughout the mathematical part of the 
investigation, because the frequency interpretation does not enter into the 
content of the probability calculus to be developed. It constitutes only an 
illustrative addition and will not be used for the derivation of theorems. 

§ 18. Derivation of the Axioms from the 
Frequency Interpretation 

It will now be shown that all axioms of the calculus of probability can be 
derived from the frequency interpretation, that is, they are tautologies if the 
frequency definition of probability is assumed. 

We start with the univocality axiom i. The case (A), to which this axiom 
refers, signifies that the relative frequency F n assumes the indeterminate 
form #, since the summation N n in (3, § 16) leads to 0 for numerator as well 
as denominator. Therefore we also have P(A,B) = #, that is, the probability 
does not possess a determinate value. This result represents one assertion of 
the axiom. If the case (A) does not hold, however, a definite limit exists; 
since there can be only one limit, the other assertion of the axiom is likewise 
satisfied. Notice that a limit exists even when only a finite number of elements 
Xi belong to A ; the value of the frequency for the last element is then regarded 
as the limit. This trivial case is included in the interpretation and does not 
create any difficulty in the fulfillment of this or the following axioms. 

Axiom ii,1 concerns the case in which each element of the form ( xieA ) 
is followed by an element (^e/i), since this is what the logical implication 
asserts. In this case all F n — 1, a result following immediately from (3, § 16), 
so that ii,1 is satisfied. The major implication in the axiom can be directed 
toward only one side, since the probability 1 can be obtained, also, if there 
are some cases in which x L e A is followed by y t e B. These cases, however, 
must be distributed so sparsely that the limit F n becomes equal to 1, though 
every individual F n may be smaller than 1. An example is given by a compact 
sequence A accompanied by a sequence B that has a B in all elements whose 
subscript i is the square of a whole number but which has a B in all other 
elements. Thus the frequency interpretation makes it clear why the probability 
1 represents a wider concept than the logical implication. 
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This consideration shows also that the probability implication of the degree 
p represents a generalization of the general implication of symbolic logic. 
Whereas the general implication demands all elements x % e A to be followed 
by a iji € B, the probability implication includes the case in which certain 
x t t A are followed by a e B , with the qualification, however, that between 
the numbers of the elements there must exist a frequency ratio that goes in 
the limit toward a determinate value. The probability implication, itself 
representing a general implication, therefore constitutes the generalization 
of the usual general implication for sequences in which the individual impli¬ 
cation occurs only in a certain number of places. Instead of demanding the 
individual implication to be valid without exceptions, we require only a 
frequency ratio. 

That ii,2 is satisfied follows directly from the fact that the relative fre¬ 
quency F n is a positive number (including 0). The condition, expressed in 
(8, § 13), that the probability degree cannot be greater than 1 likewise follows 
from the definition of the relative frequency. 

We turn now to the addition theorem hi. In order to prove this axiom, we 
form first 

(la) 


F n (A,B V C) = 
If (A .B D C) is valid, this is equal to 


N n (A.[B V C] ) 


N n (A) 


and we obtain 


N*(A.B) N n (A .C) 
N n (A)~ + N*(A) 

F n (A,B VC) = F*(A,B) + F*(A,C) 


(1 b) 
( 2 ) 


The equation remains unchanged in the transition to the limit, and for 
mutually exclusive events we have 


P(A,B v C) = P(A,B) + P(A,C) (3) 

The exclusion condition suffices for the addition of probabilities having the 
same first term. We need not presuppose, in such a case, that the terms 
B and C belong to the same sequence; this represents a special case for which, 
of course, the theorem is also valid. 

The given proof can be made clearer by the following consideration. We 
write the three sequences below one another, each in one row; however, we 
do not write the (dements x ly y ly z *, but only the classes A y B, C, to which the 
elements belong. For the sake of simplicity we shall assume that the sequence 
Xi consists only of the elements Xi e A and thus is compact. We thereby arrive 
at the following arrangement: 

AAAAAAAA. . . 

BBBBBBBB . . . 

CCCCCCCC. . . 


( 4 ) 
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The frequency F n (A,B V C) expresses the relative frequency of the A under 
which a B or a C is found. Because of the condition of exclusion, a B and a C 
can never stand simultaneously under the same A , and thus the relative 
frequencies of B and C add up to that of B V C. 

The multiplication theorem iv, also, can be derived from the frequency 
interpretation. We obtain from (3, § 16) 

Fn(A ho — N*(A.B.Q N*(A.B) N n (A B.C) 

v w N n (A) N n (A) ’ N n (A.B) 

= F n (A,B)-F n (A.B,C) (5) 

The equation remains valid for the transition to the limit, if the individual 
limits exist, and we have with the use of (5, § 16) 

P(A,B.C) = P(A,B) • P(A.B,C) (6) 

We thus arrive at the general theorem of multiplication (3, § 14). We now 
see why this form, which wc used for the theorem, is always valid. Only in 
this form does the multiplication theorem represent a tautology in the fre¬ 
quency interpretation. 

This proof, too, may be illustrated by a schema as used above: 


A A A A AAA A . . . 

BBBBBBBB . . . (7) 

QC C QC QC Q . . . 

The frequency F n (A,B.C) represents the frequency of the couples B.C; the 
first of the expressions standing on the right side of (5), F n (A,B ), counts 
the frequency of B. Now B selects from the sequence of (7s a subsequence, 
the elements of which are marked by a lower double bar in (7); this subse¬ 
quence, of course, contains elements C as well as C. The number of elements 
of this subsequence is given by N n (A.B ); therefore F n (A.B,C) means the 
relative frequency of C in the subsequence. The consideration is always appli¬ 
cable: if a term is added before the comma within a probability expression, the 
frequency is counted within the subsequence that is selected by this term. 
Formula (5) states that the desired frequency of the pair B.C can be repre¬ 
sented as the product of the frequency of B by the frequency of C counted 
within the subsequence selected by B. 

These considerations lead to an instructive interpretation of the inde¬ 
pendence relation defined in (4, § 14). The definition 

P{A.B,Q = P(A,C) (8) 

states that, within the subsequence selected by B from the C-sequence, C has 
the same relative frequency as in the main sequence. This characterization 
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reveals the meaning of the independence relation; that B does not influence C 
means that a selection by B from the C-sequence does not change the relative 
frequency. 1 For instance, when we throw with two dice and consider, 
within the sequence produced by the second die, only the subsequence of 
throws in which the first die simultaneously gives the result of face 6 lying 
up, we shall find, too, the relative frequency l for any face of the second die. 

Finally, it remains to prove that the rule of existence is derivable from the 
frequency interpretation. Since each of the axioms represents a tautological 
relation between frequencies, which holds strictly even before the transition 
to the limit, every probability formula derivable from the axioms will corre¬ 
spond also to a tautological relation between frequencies; and this relation 
will be strictly valid before the transition to the limit. Every such relation 

can be written in the form .... 

fZ = r(/f . . . /„*_,) (9) 

In this formula the /” stand for frequency expressions of the form 

f: = F-(A i9 B t ) (10) 

The subscripts in (9) and (10) indicate the fact that we are dealing here with 
frequency quantities that belong to different events A,B ... . According to 
the existence rule, r is a single-valued function, free from singularities at 
this place. Passing to the limit n->- <», we derive from the laws governing 
the formation of a limit that, whenever the /” . . • /”__ x go toward limits 
p i . . . p m - 1 , the /” also must approach a limit p m . In other words, the 
probability p m must exist whenever the probabilities p\ . . . p m -i exist. This 
is the assertion made by the rule of existence. 

At the same time we recognize why the existence of a probability is bound 
by the condition that it be determined by given probabilities. Assume that 
it is unknown in (9) for two quantities, say,/” and /”_ x , whether they go 
toward a limit. Then we cannot infer, from the fact that the other quantities 
/” . . . /”_ 2 approach certain limits, that the two residual quantities /” 
and /”_ x go toward a limit. For instance, if the probability of a logical sum 
is given, the sum / 3 n of the two frequencies 

fi +/ 2 n = /r 

approaches a limit p 3 . Yet the individual frequencies/” and/” need not go 
toward a limit. A convergence can be inferred only when it is known that, 
apart from/ 3 ”, at least one of the other quantities, say/”, approaches a limit. 

This concludes the proof that all the axioms of the probability calculus 
follow logically from the frequency interpretation. The result holds not only 
for infinite but also for finite sequences, provided that in this case we regard 
the limit of the frequency as given by the value of F n (A,B) taken for the last 
1 R. von Mises has made this idea the starting point of his probability theory. See § 30. 
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element. All the axioms are satisfied tautologically, and are strictly, not only 
approximately, valid even before the transition to the limit. 

The given proof guarantees that the frequency interpretation is an admis¬ 
sible interpretation of the theorems derivable from the axiom system. The 
interpretation will be applied in the examples used to illustrate the derived 
formulas. 


§ 19. The Rule of Elimination 

We may now proceed to the derivation of individual theorems of the prob¬ 
ability calculus from the axiom system. 



B 

Fig. 4. Schema for rule of elimination, 
according to (2). 


Many practical cases present the problem of calculating the probability 
from A to C, when C is linked to A by an intermediate term B and only the 
intermediary probabilities are given. Figure 4 may serve to illustrate the 
problem. 

It represents the divergent probabilities P(A,B) and P(A,B), having the first 
term in common, and the convergent probabilities P(A.B,C) and P(A.B,C) y 
which possess a common term after the comma. When the divergent and 
convergent probabilities are given, it is possible to calculate P(A,C). For this 
purpose we use the logical equivalence 

PVB].CaC) (1) 

and thus obtain the relations 

P(A,C ) = P(A,[BvB].C) = P{A,B.C V B.C) 

= P(A,B.C) + P(A,B.C) 

In the last equality the addition theorem has been applied because the terms 
are mutually exclusive. The use of the multiplication theorem gives the result 

P(A,C) = P(A,B) ■ P(A.B,C) + P(A,B ) • P(A.B,C) (2) 
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This formula is called the rule of elimination. It permits the elimination of 
a term B that is interpolated between the terms A and 0, and the establish¬ 
ment of a direct probability from A to C. The rule of elimination performs 
with respect to probability implication the function that is performed for 
the logical implication by its transitivity (8 1, § 4). But here the logical struc¬ 
ture is much more complicated than it is for a transitivity. The elimination 
of B can be achieved, according to (2), only when P(A .B,C) is known, 
apart from P(A y B) and P(A.B,C). The probability P(A,B) is determined 
by 1— P(A,B) f but P(A .B f C) represents an independent probability that is 
not determined by the other quantities written at the right of (2). The 
convergent probabilities P(A.B,C) and P(A.B,C) will be called nonbound 
probabilities , since their sum can be greater or smaller than 1; the divergent 
probabilities P(A y B) and P(A,B) are bound probabilities , that is, they must 
add up to the value 1. 

The theorem may be illustrated by an example previously used. Let A 
denote a hot summer day; B , the occurrence of a thunderstorm; C, a change 
in the weather. The probability of a change in weather occurring on a hot 
day can be calculated from the intermediary probability concerning the 
thunderstorm; but we must know the probability of the occurrence of a 
thunderstorm, the probability of a change in the weather on a hot day after 
a thunderstorm has occurred, and the probability of a change in the weather 
on a hot day on which no thunderstorm occurs. 

In the frequency interpretation, (2) can easily be made clear: the number 
of C’s to which a B is coordinated, and the number of C’s to which a B is 
coordinated, add up to the total number of C’s. 

The rule of elimination contains some interesting special cases. First, we 
may have 


P(A.B,C ) = P(B,C) 
P(A.B,C) = P(/5,C) 


(3 a) 
(36) 


Then (2) assumes the form 

P(A,C) = P(A,B) • P(B,C) + P(A,B) • P(B,C) (4) 

We can illustrate this form by choosing for B and B two bowls that contain 
black and white balls in different ratios, and for A another bowl containing, 
say, numerous tickets on which is written B or B . The ticket drawn from A 
decides whether the second draw should be made from B or B. By C we under¬ 
stand the event of a black ball being obtained. 

A further specialization results for 

P(A.B,C) = 0 


We then have 


P{A y C) = P(A } B) • P(A.B y C) 


(5) 

( 6 ) 
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If the specialization (3o) is added, we obtain 

P(A,C) = P(A,B) • P(B,C) (7) 

Only in this very specialized case does the rule of elimination assume the 
form of a transitivity, in which the degrees of probability are simply multi¬ 
plied. The case may be illustrated by the example above, with the qualifica¬ 
tion that the bowl B does not contain any black balls. Other examples are 
given in causal chains: for instance, when A means the presence of a hot 
day in summer; B y the occurrence of a thunderstorm; C y a flash of lightning 
hitting a house. In the special case where P(A y B) — 1 and P(B,C) — 1, the 
relation (7) determines also P(A y C) = 1; here the condition (5) is no longer 
required, since the second term in (2) drops out because of P{A y B) — 0. 1 
These relations are satisfied for logical implications of the form (A 3 B) and 
(B D C). The relation (3a), too, must hold in this case because with (B 3 C) 
we have also (A B 3 C). This is why the logical implication follows a general 
rule of transitivity that is not restricted by any conditions. It is seen, further, 
that the transitivity (7), in general, produces a decrease in the degree of 
probability. If the intermediary probabilities written at the right in (7) are 
smaller than 1, the total probability at the left in (7) will be smaller than any 
of the intermediary probabilities. A corresponding statement cannot be made 
for the general case (2); here P(A,C) represents a certain mean value between 
the other probabilities. 

A third specialization results by the assumption 

P(A.B,C) = P(A.B,C) (8) 

Introducing this condition into (2) and using the relation P{A y B) + P(A,B) 
= 1, we obtain 

P(A,C) = P(A B y C) = P(A .B,C) (9) 

Comparison with (4, § 14) shows that this means the independence of B 
and C with respect to A. In the frequency interpretation, (9) means that if 
the subsequences selected from the C-sequence by B and B , respectively, 
contain C with equal relative frequencies, this frequency is the same as in 
the main sequence. 

It has been pointed out that P(A.B,C ) is not determined by P(A y B) and 
P(A.B y C); but (2) states that a determination results if P(A,C) is added. 
This connection is expressed by the solution of (2) for P(A.B,C): 


P(A.B,C) 


P(A y C) - P{A y B) • P{A .B y C) 
1 - P{A y B) 


00 ) 


1 If it is known that P(B y A) > 0, even the condition (3a) can be omitted, because this 
condition then follows from P(B y C) — 1. See (6, § 25). 
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The relation shows how a probability containing a negation in the first term 
is calculated from the probabilities of nonnegative reference. We must except 
the case P(A,B) = 1, since in this case the value of (10) is indeterminate; 
this condition is also understood for the relations (11), (12), and (14), to be 
derived presently. 

As before, some important special cases must be considered. We see that 
with 

P(A,C) = P(A.B,C) (11 a) 

we also have 

P(A.B,C) = P(A.BjC) (116) 


in correspondence to (9); that is, the converse of the relation leading from 
(8) to (9) is valid. Furthermore, we infer that, if P(A.B,C)> P(A,C), we 
have P(A .B,C) < P(A,C), and, similarly, if P{A .B } C) < P(A,C Y ), we have 
P(A.B,C)> P(A,C). This result follows because for P(A.B,C) = P(A,C) 
the relation (10) supplies P(A .B 1 C) — P(A,C), and this value is diminished 
or increased according as P(A .B,C) is larger or smaller than P(A,C). 

For mutually exclusive events B and C, that is, P(A.Bfi) — 0, relation 
(10) assumes the simple form 


P(A.B,C) 


Another special case arises for 


P(A,C) 

1 - P(A,B) 


( 12 ) 


P(A,B) = P(A,C) (13) 

Then (10) is transformed into 

P(A.B,C) = P(A,B) = P(A 9 C) 

P(A.B,C) P(A,B) P(A,C) ^ ; 


From (10) we can derive two important inequalities that restrict the choice 
of the probabilities to be given. Since P(A .B,C) is bound by the normaliza¬ 
tion (8, § 13), the expression on the right side of (10) must lie between 0 and 1 
(with inclusion of the limits). This leads to the two inequalities 


1 - P(A,C) 
P(A,B) 


^ P{A .B,C) 


< P(A,C) 
~ P(A,P) 


(15) 


The inequality on the left side results from transformation of the condition 
that (10) must not be greater than 1; the inequality on the right side arises 
from a transformation of the condition that the numerator of (10) must not 
be smaller than 0. The double inequality is not necessarily satisfied for given 
values P(A,B) and P(A,C), even if P(A.P,C) is chosen according to the 
normalization (8, § 13). The relation (15) formulates an additional condi¬ 
tion, which prescribes a narrower domain for P(A.P,C) whenever we have 
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1 — P(A y C) < P(A y B) or P(AjC) < P(A,B). It can be shown that for inde¬ 
pendent events B and C, that is, for P(A .B,C) = P(A,C), (15) is always 
fulfilled. 2 It is permissible, therefore, to give two events as independent, 
regardless of the values of their probabilities. But if two events are given as 
dependent, the degree of dependence must be kept within the limits defined 
by (15). The occurrence of such inequalities in regard to the choice of prob¬ 
abilities may be compared to the occurrence of similar inequalities in geom¬ 
etry. A triangle, for instance, can be constructed from three given determina¬ 
tions only when their values satisfy certain numerical restrictions. Notice 
that the inequalities (15) hold also for the case P(A y B) = 1, which had to 
be excepted for (10), since in this case the numerator of (10) must be = 0 
in order to make possible a finite value of P(A B y C ), and thus the conditions 
leading to (15) are satisfied. For mutually exclusive events B and C y that is, 
P(A .B,C) = 0, (15) leads to the trivial condition P(A r B) + P(A y C) S 1. 

We turn now to an extension of the rule of elimination to disjunctions of 
more than two terms. There are special kinds of such many-term disjunctions 
Bi V . . . V B r that play a particularly important role in the calculus of prob¬ 
abilities: disjunctions that are both complete and exclusive. A disjunction is 
called complete if it is true; it then follows that at least one of its terms is true. 
A disjunction is called exclusive if not more than one of its terms is true. 
These concepts, as applied to probability sequences, are used in an extended 
sense: the disjunction must have these properties for all elements of the 
sequence. Thus completeness, in this sense, is formulated by the statement 

(BiV . . . V5 r ) (16) 

The parentheses express, according to the convention given in §§ 10, 12, the 
condition that the disjunction is true for all elements of the sequence; and it 
would be more correct to speak of completeness and exclusiveness with respect 
io the sequence. The latter qualification is always understood when the terms 
“complete” and “exclusive” are used in probability considerations. 

The combination of the two conditions of completeness and exclusiveness 
is expressed by the following r formulas, which are all-statements: 8 

(B, s£ B* . B* . . . K) 

(B t s B x . B z . . . B r ) (17) 


{Br s B 1 . B 2 . . . Br-i) 

The equivalence signs of the relations can be conceived as representing two 
mutual implications, according to (7a, § 4). The implication running from 
left to right expresses exclusiveness; the implication running from right to 

2 This is easily seen for the right-hand inequality. The proof for the left-hand inequality 
follows from the relation (5, § 23). 

* The exclusive “or” cannot be used to express these conditions. See ESL y p. 45. 
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left expresses completeness. It can easily be shown that statement (16) is 
derivable from the relations (17). 

For most of the following considerations it will be sufficient if the disjunc¬ 
tions are complete and exclusive with respect to A , that is, with respect to the 
subsequence selected by A. The symbolic expression is given by the formulas 

Bi = $2 • Bs . . . B r 

Bt = Bl.K...K (18) 

B r s B[ . F 2 . . . K-i 
From these formulas the statement of completeness relative to A is derivable: 

(A DBx V . . .vB r ) (19) 

The condition (18) can be used to replace the stronger condition (17) in all 
cases in which only probabilities containing A in the first term are concerned. 
Thus when a die is thrown, the six possible results given by the six faces of 
the die constitute a disjunction that is complete and exclusive with respect 
to the sequence of events A represented by the throwing of the die. For the 
sake of simplicity, the condition (17) will always be used, leaving the reader 
to construct similar proofs on the basis of the weaker condition (18). 

The introduction of many-term disjunctions in the rule of elimination is 
made in the same way as was used for the derivation of (2). Corresponding 
to (1), we have the relation 

([Bi V . . . VfiJ.CsC) (20) 

Applying the inference leading to (2), we derive for many-term disjunctions 
the extended rule of elimination : 

P(A,C) = £ P(A,B k ) ■ P(A.B k ,C) (21) 

Figure 5 (p. 82) may serve as an illustration. The divergent probabilities 
again are bound probabilities, so that 

t P(A,B k ) = 1 (22) 

Jfc-1 

is valid; the convergent probabilities, however, are nonbound. 

A schematized example for figure 5 is found in games of chance. Let 
Bi ... B r represent bowls containing black and white balls, each in a different 
ratio. Let C be the drawing of a black ball, and A an auxiliary bowl containing 
numerous tickets, each carrying one of the numbers 1 ... r. If there are 
more than r tickets in the bowl and each number occurs repeatedly, each 
number has a determinate probability of being drawn from the bowl. We 
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draw first from the auxiliary bowl and determine from which of the other 
bowls we are to draw next. Repeating the two actions again and again, we 
obtain a statistical relation between A and C, the frequency of which is 
determined by P(A,C ) according to (21). 



Fig. 5. Schema for extended rule of elimination, according to (21). 


Another example results by taking for A the throwing of two dice, for C 
the occurrence of face 1 of the second die, for B k the occurrence of face k of 
the first die. Then (21) means that the probability of obtaining 1 with the 
second die can be divided, additively, into the probabilities of the combina¬ 
tions in which this result is accompanied by one side k of the other die. 

Both examples represent special cases of (21), namely, cases of such a 
kind that, for the first example, P(A B k) C) = P(B k ,C) holds; for the second 
example, P(A.B k ,C) = P(A,C). This corresponds to the causal conception 
of the problem, according to which, in the first example, B k is the cause of C ; 
in the second example, A is the cause of C. However, this is irrelevant to the 
treatment of the problem within probability theory; the lines in figure 5 
represent probabilities, but not necessarily causal chains. The statement of 
the causal relationships requires specific investigation. 

§ 20. The General Theorem of Addition 

We shall now investigate the question how to calculate the probability of a 
disjunction if the terms of the disjunction do not mutually exclude one 
another, that is, if we are dealing with a nonexclusive disjunction. If, for 
example, two coins are thrown, what is the probability of obtaining tails 
with either coin, of obtaining at least one event of tails lying up? A simple 
addition would give \ + § = 1—which obviously is a wrong result. But the 
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conditions for applying the addition theorem are not satisfied, because it is 
possible to obtain tails simultaneously with both coins. In order to calculate 
the desired probability we must, therefore, transform the question into a 
form suitable for the application of the theorem of addition. Several such 
methods may be demonstrated. 

We can start from the equivalence 

(B VC s B.C V B.C V B.C) (1) 

which leads to mutually exclusive terms and thus permits us to apply the 
theorem of addition: 

P(A,B V C) = P(A,B.C V B.C V B.C) 

= P(A,B.C ) + P(A,B.C) + P(A,B.C) (2) 

In the example of the two coins, the formula gives P(A,B V C) = f, because 
each of the probabilities of the combinations is equal to \ * \ 

In practice, other methods may be used to solve the problem. Occasionally 
it is possible, using material thinking (see § 5), to contract certain steps that 
are made separately in the calculus. The following method may be used: 
(1) B occurs; then it is immaterial whether or not C also occurs. The prob¬ 
ability for this case is P(A,B). (2) B does not occur; then C must occur. 
The probability for this case is P(A,B.C). Since the cases (1) and (2) are 
mutually exclusive, the theorem of addition is applicable, and we obtain 

P(A,B V C) = P(A,B) + P(A,B.C ) (3) 

a result that is identical with (2) because of P(A,B) = P(A y B C V B C). 
This method differs from the former one in that the first two cases of the 
disjunction (1) are collected in one case by the help of material thinking. 
This thinking can also be formalized: in (5c, § 4) we have a formula that 
leads directly to (3). 

A third method starts from the equivalence 

(Bv C ss EH) (4) 

which leads with (7, § 13) to the simple result: 

P(A,P VC) = 1 - P(A,B.C) (5) 

Here the probability of the opposite case is calculated and then is subtracted 
from 1. For the example with the two coins, the probability of obtaining 
heads with both coins is equal to $ • § = £. Because in any other case at 
least one event of tails must happen, the desired probability is calculated 
to be 1 — \ f. 
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We now establish for such probabilities a fourth formula that seems very 
convenient for technical reasons. It can be derived directly from the calculus 
without the aid of material thinking. Because of 

P(A,B) = P(A,B.C) + P(A } B.C) . 

P(A,C) = P(A,B.C) + P(A,B.C ) (6) 

we can write, together with (2), the three formulas 

P(A,BVC) = P(A 9 B.C) +P(A,B.C) + P(A,B.C) 

0 = - P(A,B.C) - P(A,B.C) + P(A,B) (7) 

0 = - P(A,B.C) + P(A,C) - P(A,B.C) 

Adding the three formulas, we obtain 

P(A,B V C) = P(A,B) + P(A,C ) - P(A,B .C) (8) 

This formula is called the general theorem of addition . It is a generalization of 
the addition theorem (3, § 13), applying to nonexclusive terms. In case 
P(A,B.C) — 0 it becomes identical with the special theorem of addition 
(3, § 13). In contradistinction to the latter, (8) represents an always-true 
formula because it is not contingent upon any conditions to be expressed 
in the context. The condition of exclusion, which had to be added verbally 
to the P- notation (3, § 13) as a logical condition, is formalized mathematically 
in (8); it is expressed by the case that a mathematical quantity assumes the 
value 0. 

In the frequency interpretation, (8) can easily be made comprehensible. 
In dealing with the nonexclusive cases B and C, the couples B.C will occur 
according to, say, the following schema: 

A A A A A A . . . . 

B B BB BB ... . (9) 

C C C C C C ... . 

Adding the frequencies B and C, we shall have counted the couples B.C 
twice; therefore, to form P{A,B V C), the frequency of the couples B.C is 
to be subtracted once. This fact is expressed in (8). 

It need not be expressed as a condition that the probability of the dis¬ 
junction, as given by (8), satisfy the normalization of probabilities; this 
follows from the double inequality (15, § 19) previously established. After a 
simple transformation by means of the theorem of multiplication, the in¬ 
equality on the left side of (15, § 19) leads to 

P(A,B) + P(A,C) - P(A,B.C) g 1 


( 10 ) 
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Now the inequality on the right gives the result 

P(A,B.C) ^ P(AjC) (11) 

By interchanging B and C we obtain 

P(A,B.C) ^ P(A,B) (12) 

Therefore the following inequalities are satisfied: 

P{A,B\lC) ^ P(A,B) (13) 

P(A,B V C) ^ P(A t C) 

The probability of a disjunction is never smaller and, in general, is even 
greater than the probability of its individual terms. Thereby the character 
of the disjunction as a logical sum is clearly expressed. Addition of a term 
connected by “or” signifies an increase in probability, and only in the limiting 
case does the probability remain the same. 

Some examples may illustrate the general theorem of addition. The testing 
of a mechanical appliance reveals, on the average, 2% rejections because of 
material defects and 3% rejections because of defects in assembling the parts. 
What is the average rejection on the whole? Here the probabilities are given 
statistically, as is usual in practice. But we must not assume as total rejec¬ 
tion 3% + 2% = 5%, since the two sources of defect are not mutually ex¬ 
clusive. An appliance that is faulty because of material defects may also 
show a defect owing to assembling. We know from experience that we are 
dealing here with independent probabilities; thus we can apply the special 
theorem of multiplication. The probability of both defects occurring simul¬ 
taneously is given by the product 3% • 2% = 0.06%. Then (8) provides as 
average frequency for the total rejection 3% + 2% — 0.06% = 4.94%. 

Another example is a firm that sells its products partly through traveling 
salesmen and partly through advertisements. The statistics on customers 
reveal that 80% of all products are sold by salesmen and 60% by advertise¬ 
ments. What is the percentage of customers won by advertisements as well 
as by salesmen? Since here P(A,B V C) = 1 (we assume that all products 
are sold only in these two ways), it follows that P{A y B .C) = 80% + 60% 
— 100% = 40%, that is, 40% of the customers are won by both means 
together. 

Formula (8) permits a general calculation of the or-probability, but in 
applying it we must be sure that the case considered possesses the logical 
structure of the theorem of addition. Mistakes of this kind may be illustrated 
by two examples that were given by Richard von Mises 1 with the intention 
of showing that the addition must not be carried out uncritically, even for 

1 Wahrscheinlichkeit , Statistik und Wahrheit (Berlin and Vienna, 1928), p. 40. 
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mutually exclusive events. He wishes to restrict the theorem of addition to 
events belonging to the same “collective”, that is, the same sequence. My 
formulation of the theorem is somewhat more general, since the theorem is 
not restricted to events belonging to the same sequence. Instead, another 
condition is used, specifying that the probabilities have the same reference 
class, or first term. I shall now show that my formulas are applicable to the 
examples given by von Mises, and permit the use of the “or” in a reasonable 
sense. 

Assume that a tennis player has the probability 0.8 of winning in a tourna¬ 
ment in Berlin; he may have the probability 0.7 of winning in a tournament 
played the same day at New York. The events are mutually exclusive; thus 
one might infer that the probability of the player winning in the one or in 
the other tournament was given by the addition of the* probabilities, which 
would result in 0.8 + 0.7 = 1.5. This is certainly a nonsensical result. 

We are dealing here with a question of interpretation. A problem given in 
conversational language is to be translated into the strict language of the 
calculus; one cannot expect unambiguous rules to be available for such a 
translation. To assume that the special theorem of addition is applicable 
would be to interpret the problem in the form 

P(A,B ) = 0.8 P(A,C) = 0.7 P(A,B.C) = 0 (14) 

A representing the general situation before the tournaments; B , the victory 
in Berlin; C, in New York. It is obvious that the numerical values used in 
the interpretation violate the inequality (15, § 19), because P(A,B.C) — 0 
implies P(A.B,C) = 0, whereas the expression on the left of the inequality 
assumes the value f. This illustrates the fact that the condition of exclusion 
represents a high degree of dependence and therefore can be combined only 
with suitable numerical values of the other given probabilities. It follows that 
(14) is not an admissible interpretation of the problem. 

An interpretation that comes closer to what is intended by the formulation 
of the problem can be given. We consider the probability 0.8 of winning in 
Berlin as referring to the first term B h “if the player participates in Berlin”; 
and the probability 0.7 of winning in New York as referring to the first term 
B 2) “if the player participates in New York”. If C represents “winning”, we 
then can set down 

P(B h C ) - 0.8 P(B 2) C) = 0.7 (15) 

The two probabilities do not differ by their second term, as do the expres¬ 
sions (4), but by their first term. It is obvious that the probabilities do not 
permit the application of formula (8). The general condition A holding before 
the tournaments take place appears as a reference class in the sense of the 
theorem of elimination (fig. 5, p. 82), representing the fact that the player 
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may decide to participate in one or the other of the tournaments; and the 
condition of exclusion must then be written 

PiA'Bt.B*) = 0 (16) 

When we wish to derive from these conditions the probability of winning, 
that is, P(A,C), the two further probabilities 

P(A } Bi) P(A,B 2 ) (17) 

must be given. This means that the probability of winning depends on the 
probabilities of the player deciding, respectively, to participate in New York 
or in Berlin. 

In this interpretation the problem is solved, since P(A,B X VB 2 .C) = 0, 
by the equations 

P(A f C) = P(A, [B x VP 2 vft TB 2 \.C) 

= P{A,B X .C) + P(A,B 2 .C) 

= m,Pi) • P(A.B h C ) + P(A,B 2 ) . P(A.B 2 ,C) 

= P(A,B X ) • P(B h C) + P(A,B 2 ) • P(B 2) C) (18) 

because we may assume (10, § 14). That we cannot carry out the calculation 

numerically is due to the fact that the probabilities (17) are not given, but 
the failure to obtain a solution does not result from an inadmissible use of 
the “or”. It is clear, furthermore, that in this interpretation the sum of 
P(B Xy C) and P(B 2 ,C) can be greater than 1, since these values represent 
nonbound probabilities (see § 19). 

Von Mises presents another example that is supposed to demonstrate the 
use of an unreasonable “or”. Let 0.011 be the probability that a man 40 years 
of age will die between his 40th and 41st birthdays; and let the probability 
that a man 41 years old marries in that year be 0.009. Both events are exclu¬ 
sive for one individual. If we now want to find the probability that a man 
40 years of age either dies within the current year or marries in the following 
year, it may occur to us to add the given numbers, thus obtaining the result, 
0.011 + 0.009 = 0.020. Von Mises is right in asserting that this is a non¬ 
sensical result. 

For the conception of the or-probability developed in this section, however, 
the problem is not meaningless. The probability of a man 40 years old dying 
this year or marrying next year can be interpreted to have a definite meaning. 
It may be expressed statistically: after a lapse of two years, we count among 
all the original quadragenarians those who died within the first year or mar¬ 
ried in the second year. These numbers may indeed be added, in agreement 
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with (8). However, we must not add the numerical values given; the second 
value cannot be used because it states, not the probability that a man 40 
years of age will marry in his 41st to 42d year, but the probability that a 
man 41 years of age will marry in that period. The probabilities are not the 
same, because some of the men will have died within the year. The value 
0.009, therefore, is to be interpreted as the probability that a man 40 years 
old who reaches his 41st year will marry in his 41st to 42d year. This prob¬ 
ability is represented by P(A.B,C), if A stands for the class of quadra¬ 
genarians, B for the class of deaths among them, and C for the class of men 
41 years old who marry. We have, therefore, 

P(A,B) = 0.011 P(A,B.C) = 0 P(A.B,C) = 0.009 (19) 

and obtain 

P(A,B V C) = P(A,B) + P{Afi) 

= P(A,B ) + P(A,[B V B].C) 

= P(A,B) + P(A,B.C ) + P(A,B.C) 

= P(A,B) + P(A,B) -P(A.B,C) 

= 0.011 + (1 - 0.011) • 0.009 = 0.0199 (20) 

This represents the probability that a man 40 years of age either will die 
in his 40th to 41st year or will marry in his 41st to 42d year. 

In criticizing these examples I do not wish to deny that the probability 
calculus of von Mises supplies equally correct solutions. I intend merely to 
show that we can dispense with the relatively complicated operations of con¬ 
structing new collectives, which von Mises has introduced, and that the de¬ 
sired probabilities can be conceived reasonably as or-probabilities. 

We shall now derive from the general theorem of addition some conse¬ 
quences for later use. We can calculate a probability of the form P(A,B D C ) 
by resolving the implication into B V C according to (6a, § 4) and then 
applying the general theorem of addition. We obtain 

P{A } BDC) = P(A f B V C) 

= P(A,S) + P(A,C) - P{A,B.C) 

= P(A,B) + P(A,C ) - P(A,B ) • P(A.B,C) (21) 
By the use of (10, § 19) we arrive at 


P(A,B DC) = 1 - P(A,B ) + P{A,B) ■ P(A.B,C ) 


( 22 ) 
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In a similar way we obtain for the equivalence, by the dissolution ([B = C] 
= [ B.C V B.C}), according to (76, § 4), and with (10, § 19), 

P(A,B = C) = P(A,B.CVB.C ) 

= P(A,B.C) + P(A,B.C) 

= P(A,B) ■ P(A.B,C) + P(A,B) ■ P(A.B,C) 

= 1 - P(A,B ) - P(A,C) + 2P(A,B) • P(A.B,C) 

= 1 + P(A,B.C) - P(A,B \/C) (23) 

A formula containing an exclusive “or” will now be constructed. According 
to (1, § 4), this operation can be defined as 

6 A c — Df (6 V c).6.c (24) 

Because of the equivalence 

(6 V c). (6 . c) = (6 V c) . (6 V c) = 6 . c V 6 . c (25) 

we can write, using (76 and 7c, § 4), 

6 A c s 6 = c (26) 

The symbol of the exclusive “or” can be used also in the class calculus. 
The class BaC represents, according to (24), the common class of B V C 
and B . C, that is, the part of the joint class of B and C that results by sub¬ 
tracting the common class of B and C. Because of the relation (26) we have 

P{A,B A C) = P(A,Br^C) = 1 - P(A,B = C) (27) 

With the use of (23) we obtain, applying (8), 

P(A } B A C) = P(A,B) + P(A,C) - 2 P(A } B.C ) (28) 

Although we have thus derived a formula dissolving an exclusive “or”, the 
result shows that it is not possible, for the special theorem of addition, to 
eliminate the condition of exclusion by the use of a symbol for the exclusive 
“or”. The formula 

P{A y B AC) - P(A,B) + P(A,C ) (29) 

is false if it is conceived as holding for all B and C; it holds only if P{A,B .C) 
— 0, that is, if B and C are mutually exclusive. But if this condition must 
again be added, the introduction of the symbol of the exclusive “or” is useless. 
The aim of expressing the addition theorem completely in the mathematical 
notation is achieved, instead, in the general theorem of addition formulated 
in (8). 
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§ 21. The Rule of the Product 

So far in this study, the multiplication theorem has been written in the form 

P(A,B.C) = P(A,B) • P(A.B,C) (1) 

Since the left side is symmetrical with respect to B and C, we may write the 
corresponding equation, dividing the product in a different way: 

P(A,B.C) = P(A,C ) • P(A.C,B) (2) 

This is not a new axiom; it follows from (1) by substituting B for C and 
C for B f and in view of the fact that the “and” on the left is commutative. 
Because of the equality of the expressions written on the left in (1) and (2), 
we have 

P(AyB) • P(A . ByC ) = P(AyC) • P ( A . CyB) (3) 

This equation is called the rule of the product . 

For example, let P(AyB.C) be the probability that a person, A } shows 
ability for physics, B y as well as for music, C. 1 Then (1) represents one form 
of splitting the probability of the product into two probabilities: the prob¬ 
ability that a person has an ability for physics and the probability that a 
person endowed with an ability for physics also shows a talent for music. 
Formula (2) represents the opposite splitting of the probability of the prod¬ 
uct, namely, into the probability that a person has a talent for music and 
the probability that a musically gifted person also shows ability for physics. 
The two probabilities having two terms in their reference class are not equal; 
rather, according to (3), they have the ratio 

P(A .ByC) = P(AyC) 

P(A.CyB) P{A,B) W 

The probability P(A,C) of being musically gifted is, in general, much greater 
than the probability P(A,B) of having any ability for physics. Therefore, 
according to (4), the probability that a person who is able in physics has a 
talent for music must be much greater than the probability that a musician 
shows an aptitude for physics. But we know from experience that some con¬ 
nection exists between ability in physics and in music, such that P(A.ByC) 
> P(A,C)', musical talent occurs more frequently among persons who are 
able physicists than corresponds to the general average. Therefore, because 
of (4) it must equally be the case that P(A.CyB) > P(A } B)y that is, among 
musicians a lso there must be a higher percentage of people with an ability in 

! ^ does ^t matter that, in this example, according to the convention given, we should 
write for AyB } C the same capital letters but with different subscripts. 



§ 21. THE RULE OF THE PRODUCT 91 

physics than corresponds to the average. The ratio must be the same in both 
cases, since (4) may be written 

P(A.C ,B) = P (A.B } C ) 

P(A,B) P(A,C) 

Furthermore, we derive from (5) that if P(A.C,B) = P(A,P), then also 
P(A.BjC) — P(A,(7). This means that the independence of B and C with 
respect to A is a relation symmetrical in B and C (see p. 105). The condition 
of exclusion is also symmetrical, because if P(A B,C) = 0, then P(A .C,B) 
= 0, according to (5), provided P(A,/i) and P(A,C) are different from 0. 
Solving (3) for P(A C y B), we obtain 

P(A. C,B ) = P(A . B,C) ■ (G) 


This relation shows that the probability P(A . C y B) is determined by the 
three probabilities P(A,/i), P(A,C), and P(A.B,C). Since the latter three 
probabilities determine also the probability P(A.P,C), as is shown in 
(10, § 19), it follows that all probabilities of B and C relative to A are deter¬ 
mined by these three probabilities. Thus P(A.C,B) is derivable by means 
of (6), when we substitute there C for C and use the rule of the complement 
(7, § 13). 

P(A.(lB) = [1 - P(A . B 9 C )] • ^ (7) 


The three probabilities 


1 - P(A,C) 


P(A,B) P(A y C) P(A.B } C) 


(7') 


will be called the fundamental probabilities of the three events A,P,(7; they 
determine completely the probability status of B and C with respect to A. The 
choice of these values as fundamental must be regarded as a convention; 
any other three independent values might be chosen, for instance, the values 
P(A,P), P(A,C), P(A,P.C). But the convention will be seen to be expedient. 

The numerical values of the fundamental probabilities can be chosen arbi¬ 
trarily when a problem is to be given; they are subject only to the restrictions 
of the inequalities (15, § 19). It has been shown in § 20 that these inequalities 
suffice to guarantee that the probabilities P(A,P V C) and P(A,P.C) are 
between 0 and 1, limits included. Formula (6) shows that then P(A.C,P) 
also is bound to these limits, since we derive from the condition on the right 
of (15, § 19) that 


P(A.P,C) 


P(A,P) 

P(A,C) 


< 1 


( 8 ) 


The inequalities (15, § 19) formulate, therefore, the necessary and sufficient 
restrictions to which the fundamental probabilities are subject. 
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The term “restrictions” is applied to an arbitrary choice of numerical 
values such as is made when fictitious problems are constructed. For all sta¬ 
tistics that are empirically compiled, these restrictions are satisfied auto¬ 
matically. With respect to applications, the result can be stated as follows: 
when three events are concerned, it is sufficient to ascertain statis¬ 

tically the values of the three fundamental probabilities; these values, which 
will always satisfy the inequalities (15, § 19), are sufficient to derive all other 
probabilities of B and C that have the term A as reference class or as a factor 
of the reference class. 

If B and C are events that stand to each other in the relation of cause to 
effect, (6) becomes of particular interest. For instance, let A represent the 
occurrence of a hot day in summer; B, the occurrence of a thunderstorm; 
C, the occurrence of a change in the weather. Then P(A. C,B) represents 
the probability that a change of weather observed on a hot day has been 
preceded by a thunderstorm. In contradistinction to the example referring 
to talents in music and physics, in which the probabilities P{A . B,C) and 
P(A. C,B) express mere correlations, the probabilities refer here to causal 
relations: the thunderstorm is a possible cause of the change in the weather. 
The quantity P(A.B,C) is therefore the probability that a certain cause 
will produce a particular effect, and the quantity P(A.C,B) is the prob¬ 
ability that an observed effect was produced by a specified cause. With 
respect to applications of this kind, (6) is also called the rule for the prob¬ 
ability of a cause. In this interpretation (6) is usually given another form. 
Considering B as another possible cause of C, and expanding P(A } C) according 
to the rule of elimination (2, § 19), we transform (6) into 


P(A.C,B) 


P(A y B) • P(A.B,C) 

P(AjB) • P(A.Bfi) + P(A,B) • P(A.B,C ) 


( 9 ) 


The expression obtains a more general form when the version (21, § 19) of 
the rule of elimination is used: 


P { A.C,B k) = 

^TP(A,Bi) • P{A.Bi,C) 


( 10 ) 


This formula carries the name of the English clergyman Thomas Bayes 2 and 
is called the rule of Bayes. The schema of figure 5 (p. 82) may serve again 
as illustration. 

2 Thomas Bayes’ Essay towards Soiling a Problem in the Doctrine of Chances was published 
after his death in Philosophical Transactions of the Royal Society of London , Vol. 53 (1763), 
p. 370. This paper gives only the simplified version (12) of the formula. The general formula 
(10) was introduced by Pierre Simon Laplace, Thtorie analytique des probabilitts (Paris, 
1812), Vol. II, chap. 1 (3d ed.; Paris, 1820), p. 182. The major interest of both authors con¬ 
cerned the application of the rule to the derivation of convergence formulas of induction, as 
presented in § 62. 
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The quantities P(A,Bi), which occur in Bayes’s rule, have been named 
“a priori probabilities”. The term is misleading because of its metaphysical 
connotations, and I prefer to call them antecedent probabilities. The name 
indicates that in these probabilities the event B t is referred to certain general 
data A the acquisition of which precedes the observation of the specific 
data included in C. It goes without saying that antecedent probabilities are 
of the same type as all other probabilities. 

The probabilities P(A. C,B k ) are called inverse probabilities. Bayes’s rule 
determines the inverse probabilities as functions of the forward probabilities, 
the latter term including both kinds of probabilities occurring on the right 
of (10). It is important to realize that such a determination is possible only 
if, among the forward probabilities, the antecedent probabilities are given; 
without a knowledge of the latter the problem would be indeterminate. Only 
when the antecedent probabilities are all equal, that is, when 


= P(A,B 2 ) = . . . = P{A,B r ) (11) 


do they disappear in the formula, since then (10) assumes the simplified form 


P(A.C,B k ) 


P(A.B k ,C) 
E P(.A.B it C) 


( 12 ) 


But in order to apply (12) we must have the positive knowledge expressed 
in (11). It is by no means permissible to use (12) when the values of the 
antecedent probabilities are unknown. Absence of knowledge of numerical 
values is not equivalent to knowledge of their equality. The disregard of this 
simple logical fact has become the source of many erroneous interpretations 
of Bayes’s rule. 3 When nothing about the antecedent probabilities is known, 
we must simply admit that the inverse probabilities cannot be determined. 

The following example may serve as a numerical illustration of Bayes’s 
rule. A factory A has three machines for the manufacture of a certain prod¬ 
uct; machine Bi produces 10,000 pieces daily; machine B 2 , 20,000 pieces; 
machine 5 3 , 30,000 pieces. All three machines occasionally produce faulty 
pieces, C; and, specifically, the first machine has on the average a rejection 
of 4%; the second, of 2%; the third, of 4%. A characteristic sample is found 
among the rejects, and we ask for the probability stating by which of the 
three machines it was produced. We have here 


P(A,B0 


1 0,0 0 0 _ 1 
60.000 6 


P(A,Bt) 


20.000 _ 1 
60.000 ^ 


P(A RA = 30 ’ 00Q = X 

£ yJlyLjs) 60.000 2 

8 These misinterpretations go back to Bayes and Laplace, who regarded it permissible to 
apply (12) when the antecedent probabilities are unknown; the name a priori probabilities was 
used with reference to such an “a priori reasoning’’. See the criticism of the principle of in¬ 
difference in § 68. 
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P(A.B 2 ,C) = 2% = xjhr P(A.B h C) = 4% = T U 
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We see clearly the influence of the antecedent probabilities P(A,B k ), which 
are calculated in a simple way from the distribution of the total production 
over all the machines. Though the second machine works twice as well as 
the first, it is equally probable that the rejected piece originates from the 
second as from the first machine; this is due to the fact that the second 
machine produces twice as many pieces. The third machine, which supplies 
half of the total production, is to be assigned the probability f of having 
produced the reject; this probability is greater than \ because one of the 
two other machines works more reliably. Therefore, of all rejects, 20% orig¬ 
inate from the first, 20% from the second, and 00% from the third machine; 
this represents the statistical meaning of the inverse probabilities calculated. 
At the same time we recognize that without such antecedent probabilities 
the problem is not determined. Should we consider only the efficiency ratio 
of the machines, 4:2:4, and calculate the inverse probabilities as j%-, T %, T V, 
this would mean putting P(A,Bf) = P(A,B 2 ) = P(A,B d ); but we must check 
whether the assumption is justified. The probability of causes cannot be cal¬ 
culated without a knowledge of the antecedent probabilities. (Further exam¬ 
ples are given in the appendix to chap. 3, pp. 123-127.) 

The range of application for Bayes’s rule is extremely wide, because nearly 
all inquiries into the causes of observed facts are performed in terms of this 
rule. The method of indirect evidence , as this form of inquiry is called, consists 
of inferences that on closer analysis can be shown to follow the structure of 
the rule of Bayes. The physician’s inferences, leading from the observed 
symptoms to the diagnosis of a specified disease, are of this type; so are the 
inferences of the historian determining the historical events that must be 
assumed for the explanation of recorded observations; and, likewise, the 
inferences of the detective concluding criminal actions from inconspicuous 
observable data. In many instances the use of probability relations is not 
manifest because the probabilities occurring have either very high or very 
low values. Thus, when a corpse is found, it is virtually certain that a murder 
has been committed; and a fingerprint on the handle of a pistol may be con- 
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sidered as strict evidence for the assumption that a certain person X has 
fired the pistol. That even in such cases the inference has the structure of 
Bayes's rule is often seen from the fact that appraisals of the antecedent 
probabilities are made. Thus an inquiry by the detective into the motives 
of a crime is an attempt to estimate the antecedent probabilities of the case, 
namely, the probability of a certain person committing a crime of this kind, 
irrespective of the observed incriminating data. Similarly, the general induc¬ 
tive inference from observational data to the validity of a given scientific 
theory must be regarded as an inference in terms of Bayes’s rule. 4 

The theory of indirect evidence has been obscured by the assumption that 
there exists an inference leading from an implication (B D C) to a probability 
implication (C -s- B), which would enable us to infer with probability from 

Q 

an observed effect C the presence of the cause B . This inference has been 
called an inference by confirmation . 6 The analysis of the calculus of prob¬ 
ability shows that no such inference exists. The probability of a cause B 
can be inferred from the observation of the effect C only if all the probabilities 
occurring on the right-hand side of (9) are known. The relation (B D C) 
supplies only P(A.B,C) = 1. There remain to be known, therefore, the 
antecedent probability P(A,B) and the probability P(A.B,C). These values 
are in no way restricted by the fact that (B D C) holds and must be inde¬ 
pendently ascertained for this as well as for the general case P(A .B,C) < 1. 

For the case P(A .B,C) = 1, a weaker inference can be made when it is 
known, at least, that the other probabilities on the right-hand side of (9) 
exist. Putting P(A,B) = /j, P(A .Bfi) = v, this side then assumes the form 

V ( 13 ) 

V + (i - P> 

If v = 1, the denominator will be = 1; if v < 1, it will be < 1. The fraction, 
therefore, will be ^ j), and we have the inequality 

P(A.C,B ) ^ P(A } B) (14) 

If we know that v < 1, we can say that the observation of C will increase the 
probability of B. But even the latter statement presupposes more than the 
observation of C\ besides the knowledge that P(A.B,C) exists and is < 1, it 
presupposes knowledge about the existence of the probability P(A,B). It is 
obvious, furthermore, that when inferences of indirect evidence are made, 
the conclusion is not restricted to the assertion of a mere increase in prob- 

4 For a more elaborate discussion of this inference see §§ 84-85. 

6 R. Carnap, “Testability and Meaning,” in Philos, of Science , Vol. Ill (1936), p. 420; 
Vol. IV (1937), p. 1. Instead of the relation of implication in B D C, other relations that 
make the inference even worse are sometimes used. 
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ability. We wish to assert more, namely, to arrive at an estimate of whether 
the probability P(A.C,B) is a high value. This aim can be reached only 
when the values P(A,B) and P(A .B,C ) are known to a certain degree of 
approximation. 

The so-called inference by confirmation, therefore, represents an incom¬ 
plete schematization of the inference actually made in such cases. When it 
seems that we sometimes do infer from an observed consequence that an 
assumption is probably true, as in the confirmation of a scientific theory by 
the observational test of its consequences, such a procedure is possible only 
because more is known than is explicitly stated in the inference, in other 
words, because we have estimates of the other necessary probabilities. This 
additional knowledge plays a part in the inferences actually made, as may be 
illustrated by the problems given in the exercises (see the appendix to chap. 3, 
pp. 123-124). 

An inferential schema like the inference by confirmation, which omits this 
knowledge in its premises, must be regarded as an instance of the fallacy of 
incomplete schematization. Like other fallacies, it will sometimes lead to cor¬ 
rect results; that will be the case when the additional premises are true. 
But it does not represent a valid inference, because it does not state all the 
premises required for the truth of the conclusion. Such mistaken interpre¬ 
tations of the method of indirect evidence make it clear that a satisfactory 
analysis of the method can be given only when it is construed as an inference 
that follows the rules of the calculus of probability. 


§ 22. The Rule of Reduction 


In connection with the schema of “bifurcation” as shown in figure 5 (p. 82), 
we shall now derive a theorem for a probability containing an “or” in its 
first term, that is, in the reference class. We can obtain such a probability 
by using theorem (1, § 14), which shows a way of bringing a symbol B from 
the second into the first term of a probability expression. It is convenient to 
use, instead, the mathematical notation (3, § 14), which, when we put D 
for C, may be written 


P(A.B,D ) 


P(A,B.D) 

P(A,B) 


(i) 


Let B V C be an exclusive disjunction that is incomplete with respect to A. 
If we substitute B V C for B and apply the distributive law and the special 
theorem of addition, we obtain 

P(A,B.D ) + P(A,C.D ) 

P(A,B) + P(A,C) 


P{A.[B V C],D) 


( 2 ) 
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Solving the terms in the numerator by the theorem of multiplication, we 
arrive at the formula 


P(A.[BvC],D) 


P(A,B) • P(A.BJ)) + P(A,C) • P(A. C,D) 
P{A,B) + P(A,C) 


( 3 ) 


This theorem, which is valid only when B and C are mutually exclusive, 
will be called the special rule of reduction. It solves a probability with a dis¬ 
junction in the first place of the probability functor in terms of individual 
probabilities, that is, probabilities that do not contain a disjunction in the 
first place, but may contain a conjunction in that place. The theorem can 
easily be extended for exclusive disjunctions of the form B x V . . . V B m that 
are incomplete relative to A : 

jrp(A,B k ) ■ P(A B k ,D) 

P(A.[Bl V . . . V Brn],D) = ^---• (4) 

E nA,B k ) 

k -1 

The name rule of reduction is chosen in order to express the fact that the 
reference class on the left in (4) can be conceived as resulting from the general 
reference class A by a reduction. This is to be understood as follows. The 
reference class A is the same as the class A. [B\ V . . . V B r ] when the latter 
disjunction is complete relative to A. The reference class A .[B x V ... V B m \ , 
containing a disjunction that is incomplete with respect to A, results from the 
former by the canceling of some of the B t —a process that may be called a 
reduction. Such a reduction will be used when additional knowledge permits 
us to drop some of the Bi. 

An example chosen from political elections may serve as an illustration. 
Let Bi ... B r represent candidates of several political parties for a high 
office, say the presidency of a nation; let P(A,P*) be the probability with 
which the election of the candidate B k may be expected in the situation 
A existing before the votes are cast; and let B x V . . . V B r be a complete dis¬ 
junction relative to A. This disjunction is also exclusive when the political 
office can be occupied by only one candidate. Let D be a certain action of 
economic importance; it may be expected with a probability P(A.B k) D) 
that the candidate B k carries out this action successfully. For example, D may 
be the conclusion of a commercial pact with another country. [In this case, 
P(A .B k ,D ) would be equal to P(B k ,D) } which is, however, irrelevant to 
the example.] Before the beginning of the elections the probability P(A,D) 
of the signing of the commercial treaty is calculated according to the rule 
of elimination (21, § 19). Now assume that the elections are under way, and 
that it is already known that certain candidates are not elected; so only a 
part Bi V . . . V B m (m < r) of the candidates remain to be considered. The 
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probability with which the signing of the commercial pact is to be expected 
is then obtained by a reduction of the reference class and is determined by (4). 

Equation (4) expresses a characteristic asymmetry between the first and 
the second terms of a probability implication. An “or” in the second term 
leads to an addition of probabilities, but an “or” in the first term, as is recog¬ 
nized from (4), leads to an addition combined with a division, that is, to the 
formation of a mean value . This becomes obvious through consideration of 
the special case in which 

P(A.B h D) = P(A.B*,D) = . . . = P(A.B m ,D) (5) 

Then we obtain from (4) 

P(A.[Bi V . . . vB m ],D) = P(A.B h D) = . . . = P(A.B m ,D) (6) 

Here the addition of or-terms in the first term does not change the probability. 
If we assign to each of the candidates who are not yet eliminated an equal 
probability that he will successfully carry out the signing of the pact, then it 
is immaterial which candidate is elected. The probability of the signing of the 
treaty does not depend on the further outcome of the election. Furthermore, 
it is of no importance for the relation (6) whether all the P(A,B k ) are equal; 
the values of the P(A . B n ,D) (m< n< r) no longer matter, because the 
respective candidates are already eliminated from the election. 

If the disjunction is complete relative to A, (4) represents a second form 
of the rule of elimination (21, § 19), since the denominator becomes equal to 1: 

P(A,D) = P{A\B X V . . . V B r ],D) = X) P(A,B k ) ■ P(A.B k ,D) (7) 

Jfc -1 

When we write the disjunction in the form B V B, we arrive at the equation 

P(A,D) = P(A.[BvB],D) = P(A,B) • P{A.B,D) 

+ P(A,B ) • P(A.B,D) (8) 

From (7) and (8) we see that a B occurring in the first term can be eliminated 
like a B in the second term, whereas the right side of the equation assumes 
the same form as in (2, § 19) and (21, § 19). 

The difference between the rule of reduction and the rule of elimination, 
however, is made clear when we consider disjunctions which, though exclu¬ 
sive, are incomplete with respect to A. The formula 

m 

P(A, [B, V . . . V B m ].D) = £ P(A,B k ) ■ P(A.B k ,D) (9) 

£-1 

which corresponds to the rule of elimination, is then always true, although 
this probability is not equal to P(A,D). The corresponding formula with the 
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disjunction in the first place is given by (4); here the sum in the denominator 
is added. 

When we add the assumption (5) to (7) and extend (5) to hold for all terms 
of a disjunction B\ V . . . V B r) which is complete relative to A, we obtain 
from (7), analogous to (6), 

P(A,D) = P{A\B^ . . . V B r ],D) 

= P(A.B h D) = . . . = P(A .B r ,D) (10) 

This represents the trivial assertion that, in this case, the probability from 
A to D is the same as the probability from A together with any B t to D. 

Formula (4) will now be presented in a different form in order to make 
its structure clearer. Taking into account the condition of exclusion, namely, 

P(A.B k ,B k ) = 1 P(A .B k ,B t ) = 0 for k ^ i (11) 

and substituting B k for D, we obtain from (4) 

P(A.[B , V . . . vB m ],B k ) = - f (A ’ Bk) 1 (12) 

Z m,Bd 
1 = 1 

These expressions may be called reduced probabilities; they represent the 
probability that B k has with respect to A in combination with the terms of 
the incomplete disjunction. In the example given, they represent the prob¬ 
ability of a candidate B k being elected when we know that only the candidates 
B i . . . B m remain. The reduced probabilities are bound probabilities because, 
according to (12), m 

£P(A.[BiV. . . 

k~ 1 

= P(A.[B, v . . . vB m ], [£i v . . . viy) = 1 (13) 

This also follows from axiom n,l; when the term in the square brackets in 
(13) is denoted by B, the second expression assumes the form P(A.B,B), 
and this probability is = 1 because of (A.B 3 B). 

Using (12), we can write (4) in the form 

P(A.[2? X V . . . vB m ],D) 

m 

= £ P(A. lB t V . . . V B m ],B k ) ■ P(A. B k ,D) (14) 

The desired probability containing the “or” in the first term is here deter¬ 
mined by a summation of terms, each of which contains a probability 
P(A.B kf D ) multiplied by the corresponding reduced probability. For the 
calculation of the probabilities having an “or” in the first term, the prob- 
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abilities P(A.B k ,D ) do not suffice —a peculiarity reminiscent of Bayes’s 
rule. They must first be multiplied by the reduced probabilities 
P(A.[FiV . . . V B m \,B k ), which, in turn, are determined by the values 
P{A,B k ) according to (12). Without these divergent or antecedent probabil¬ 
ities the problem remains indeterminate. 

We turn now to the extension of these results to nonexclusive disjunctions. 
As before, we start from (1). However, when we substitute here for B the 
the nonexclusive disjunction BvC, we must use the general theorem of addi¬ 
tion and thus obtain, instead of (2), the formula 


P(A. [B V C],D) 


P(A,B.D ) + P(A,C.D ) - P(A,B.C.D ) 
P{A,B) + P{A,C) - P(A,B.C ) 


(15) 


Applying the general theorem of multiplication, we arrive at the relation 

P(A.[B VC],£>) = (16) 

P(A,B ) • P(A. B,D)+P(A,C) • P(A. C,D)-P(A,B ) • P{A. B,C) ■P(A.B. C,D) 
P(A,B ) + P(A,C) - P{A,B) ■ P{A.B,C) 


This formula will be called the general rule of reduction. It contains the spe¬ 
cial rule, expressed in (3), as the special case resulting for P{A. B,C ) = 0. 

We can use formula (16) to determine generalized reduced probabilities, 
corresponding to (12); for this purpose we substitute B for D. Taking account 
of the fact that P(A.B,B) = 1 and using (3, § 21), we obtain 


P(A.[B VC],B) 


_ P(A,B) _ 

P(A,B) + P(A,C ) - P(A,B) ■ P(A.B,C) 


P(A. [B V C],C) 


_ P(A,C) _ 

P(A,B) + P{A,C) - P(A,C ) • P(A.C,B) 


(17) 

(18) 


These formulas determine the value of the reduced probabilities for nonexclu¬ 
sive disjunctions. The denominators of (17) and (18) are equal because of 
(3 ’ §21) - 

Introducing the reduced probabilities into (16), we can give to the general 
rule of reduction a form corresponding to (14): 


P{A. [B V C],D ) = P(A . [B V C],B) • P{A. B,D ) 

+ P(A.[B V C),C) ■ P(A.C,D ) - P{A.[B V C],B) 

• P(A.B,C) • P(A.B.C,D) (19) 

For exclusive disjunctions the last term disappears because P(A.B,C ) = 0, 
and the formula is thus transformed into (14) written for m = 2. 



101 


§ 22 . THE RULE OF REDUCTION 

It is possible to extend the general rule of reduction to disjunctions of 
more than two events. The resulting formula, however, is cumbersome because 
of the complicated form that the general theorem of addition assumes for 
more than two events. Therefore it is not presented here. 

Consider a schema that represents a combination of the rule of reduction 
with the rule of Bayes. Assume the observation of an event D that can be 
explained by several possible causes Bi . . . B r . We do not know which of 
the causes exists, but we know their antecedent probabilities P(A y B k ) relative 



?(A.B 

p(A:b;.ph 



Fig. 6. Schema for rule of composition, according to (21). 


to a common first term A and, furthermore, the probabilities P(A.B k ,D) 
for the production of D by the individual causes B k . The disjunction B\ ... B r 
may be complete and exclusive. We ask for the probability P{A.D y E) of 
an event E resulting from D. What are known, however, are only the indi¬ 
vidual probabilities P(A B k .D y E) y which confer upon E a probability rela¬ 
tive to D and to the cause B k that produced D. The schema is illustrated 
by figure 6. 

For example, assume that D means a symptom of disease that may be 
explained by several possible causes, and let E mean the case of death. We 
know the probability of a lethal issue for each of the diseases B k , and we ask 
for the probability of the death of the patient who shows the symptom D. The 
probability can be constructed as a mean in terms of the rule of reduction, 
after the probabilities of the individual causes B k have been computed through 
the rule of Bayes. For this purpose, in turn, we must know the antecedent 
probabilities P(A y B k ) y in which A means the class of persons of a certain 
age and state of health; moreover, we must know the probabilities P(A .B k ,D) 
for the production of the symptom D by the individual diseases B k . 
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We have 

P(A.D,E ) = PiA iB, V . . . V fi r ] , D,E) 

-±P(A.D,B k ) ■ P(A.D.B k ,E ) (20) 

A: = 1 

according to (4), when we put in (4) A .D for A and E for D, the denom¬ 
inator being = 1 because of the completeness of the disjunction. Putting for 
P(A .D,B k ) its value resulting from the rule of Bayes (10, §21), we obtain 

X) P(A,B k ) ■ P(A .B k ,D) ■ P(A .B k .D,E) 

P(A . D,E ) = ^--- (21) 

'ZP(A,B k ) ■ P(A Bk,D) 

1 

This formula may be called the rule of composition. It shows how the integral 
probability from D to E is composed of the individual probabilities that 
depend on the cause B k . 

In the interpretation given, the rule of composition represents an inference 
from the present (D) by way of the past ( B k ) to the future (E). Such infer¬ 
ences occur in many kinds of scientific prognoses. As explained for the rule 
of Bayes, however, the temporal interpretation is not the only possible inter¬ 
pretation; formula (21) and figure Ga represent a logical structure capable of 
many interpretations. 

Note that the rule of composition (21) becomes identical with the rule of 
elimination (21, § 19) when A is identical with D. 


§ 23. The Relation of Independence 

The independence of two events B and C was defined by the condition 

P{A.Bfi) -P(A,C) (1) 

We then derived for the theorem of multiplication the special form 


P(A,B.C) = P(A,B) • P(A,C) 


( 2 ) 


It is also possible to consider (2) as the definition of independence and then 
to derive (1). This method has the disadvantage that it breaks down if 

P(A,B.C) 

P(A,B) ~ 0, since then the fraction — - 7 ~ riz r' assumes the indeterminate 

P[A,B) 

form % and thus does not determine the value of P(A.B,C). In this case, 
(1) is not derivable from (2), whereas (2) is always derivable from (1), even 
for the case P{A } B) = 0. This is why it is preferable to define independence 
by (1). 
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The general theorem of addition assumes for independent events the par¬ 
ticular form 


P(A,B \l C) = P(A,B) + P(A,C) - P(A,B) • P(A } C) 


( 3 ) 


If the values P(A y B) and P(A ,C) are small numbers, the value of their 
product is small within a lower order of magnitude; for such events the 
product term in (3) can be omitted—which means that in practice it is per¬ 
missible, for low probabilities, to replace the general rule of addition by the 
special one. Thus, if P(A,B) = P(A,C) = y—y , their product is yroo J”rnro i 
this value can be neglected in (3), and we have, with sufficient approximation, 
P(4,BVC) = ^, 

Although the general inequalities (10 and 13, §20) show that the prob¬ 
ability P(A,B V C ) cannot be greater than 1 or smaller than 0, a simple proof 
may be added to show that this condition is always satisfied for the form (3). 1 
Let us put 

P(A,B) = p P(A,C) = q (4) 

Then the condition under consideration requires that 


0 ^ p + g — pq ^ 1 


( 5 ) 


Now this inequality holds for all numbers p and q between 0 and 1, limits 
included. To show this, we put 

V = 1 - p' q = 1 - q[ (6) 


Inserting these expressions in (5), we arrive at 

0^1- pY g 1 


( 7 ) 


This is indeed true if p' and q f are between 0 and 1, limits included. The case 
of equality with 0 in (7) can occur only when both p' and q f are = 1, that is, 
when both p = 0 and q — 0; and equality with 1 will occur only when p' = 0 
or q f — 0, that is, when at least one of the two values p or q = 1. In all 
other cases the expression considered in (5) will differ from its lower and 
upper limit. 

It is important to realize that the independence defined by (1) is a threc- 
place relation, that is, a relation involving the terms B,C, and A. We must 
say, B is independent of C with respect to A. Without stating the reference 
term A we cannot speak of independence. This is contrary to linguistic usage, 
in which the reference term usually is not expressed. 

It may be asked whether this usage can, perhaps, be justified by saying 
that certain events B and C are independent relative to all A as reference 


1 See also the remarks following (15, § 19). 
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classes, so that the three-place relation can be transformed into a two-place 
relation by generalization in A. However, it turns out that this assumption 
is erroneous, for there always exist events A in respect to which any two 
events B and C are mutually dependent. 

This can be proved by choosing A as given by the disjunction A. [B V C]. 
We can then show that, relative to this reference class, B and C are not inde¬ 
pendent. Using formula (18, § 22) and applying (1), we obtain the following 
expression for a reduced probability, holding for any events B and C inde¬ 
pendent of each other with respect to A : 


P(A .[B V C],C) 


_ P(A,C) _ 

P(A 9 B) + P(A,C) - P(A,B) • P(A,C) 


( 8 ) 


Applying the inequality (5) to the denominator, we see that, apart from the 
extreme cases P(A,B) = 1 or P(A,C) — 1 , the expression (8) is > P(A,.C). 
Using (4e, § 4) and (1), we have 


P(A .[B V C].B,C) = P(A .B } C) - P(A,C) (9) 

Therefore we have 

P{A .[BVC]. B,C) < P(A. [B V C]fi) (10) 

Thus with respect to the reference class A. [B V C] the two events B and C 
are not independent. 

This result may be illustrated by an example concerning bets on two 
horses in different races. Let B be the case that the first horse wins; C, that 
the second horse wins. A is given by the general conditions before the races. 
Assume P(A,B) = 50%, P(A,C) = 80%. Relative to the general conditions 
A y the two results are independent and thus (1) is satisfied; if the first horse 
wins, the chances for the other are not changed. Assume that the races are 
over and we are told that one of the horses has won, but not which horse it is. 
The probability that the second horse has won, relative to what we know 
now, is given by (8); this formula furnishes the value 89%. So we now have 
a greater chance than before that the second horse has won. At this moment 
we learn that it was the first horse that won; the result as to the second horse 
is still unknown. Now the probability that the second horse has won is given 
by (9) and is the same as in the beginning, namely, 80%. This shows that 
relative to the situation A .[B V C] the case that the second horse has won 
is not independent of whether the first horse has won. In this situation, 
additional knowledge as to the winning of the first horse will change the 
probability with which we may expect the second horse to have won. 

This example shows that we must regard independence as a three-place 
relation, which is comparable, for instance, to the geometrical relation “be¬ 
tween”: the statement, “A lies between B and C”, can be formulated only 
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for three terms. Just as the between-relation is symmetrical with respect to 
the terms B and C , so is the independence relation symmetrical in B and C. 
For it has been shown in (5, § 21) that if (1) is valid, it is likewise true that 

P(A.C,B) =P(A,B) (11) 

Furthermore, the independence relation is similar to the between-relation in 
that it is not transitive with respect to B and C. If B and C are mutually 
independent relative to A , and if C and D are mutually independent with 
respect to A, then B and D need not be mutually independent relative to A. 
In the case of the between-relation there even exists intransitivity for B and C, 
that is, if A is between B and <7, and A between C and D , then A never lies 
between B and D. The independence relation is only nontransitive , that is to 
say, in the case considered B and D may be mutually independent with 
respect to A , but such is not necessarily the case. An instance of the non- 
transitive case is obtained when two dice are linked with a piece of string 
and, besides, a third, free die is used. The first two dice produce mutually 
dependent sequences, each of which, however, is independent of the third 
sequence. 

A further property of the independence relation must now be presented. 
Let three events B,C,D be given, any pair of which is mutually independent 
with respect to A ; then it does not follow that one of the events is independent 
of the other two with respect to A. We must understand this statement in 
the following way. From the relations 

P(A.B,C) = P(A.D,C) = P(A,C) 

P(A.C y D) = P(A B,D) = P(A,D) (12) 

P(A.D,B) = P(A . C,B) = P(A,B) 

it does not follow that the relations 

P(A.B.C,D) = P(A,D) 

P(A.C.D,B ) = P(A,B) (13) 

P(A . D . B t C) = P(A,C) 

also hold. This fact is expressed by saying that the independence relation is 
not combinable. 

This is shown by the following considerations. If we add to the probabil¬ 
ities on the left side of (13) those obtained by negating B,C\D in the refer¬ 
ence class, that is, if we regard probabilities of the kind P(A .B.C,D) or 
P(A .B.C } D) or P(A .B.C,D), there are twelve probabilities that have a 
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triple reference class. For these, according to the rule of elimination, there 
are only six independent equations of the form 

P(A,D) = P(A.B,D) = P(A.B,C) • P(A.B.C,D) 

+ [1 - P(A.B,C)] • P(A.B.C,D) (14) 


The probabilities having a triple reference class are, therefore, not deter¬ 
mined by the probabilities having a single or a double reference class, and 
thus (13) does not follow from (12). 

An example of a case for which (12) is valid, but not (13), is provided by 
the sequences 


A AAA AAA A . 
BBBBBBBB . 
CCCCCCCC. 
DDDDDDDD . 


(15) 


for which the first part written down is to be repeated periodically in the 
same order. Here all the probabilities (12) are equal to But P(A .B.C,D) 
is equal to 1; so is P(A . B. C,D), and so on. 

Sequences for which, apart from the relations (12), the relations (13) are 
fulfilled, are called completely independent . This notation applies similarly for 
a greater number of sequences. 


§ 24. Complete Probability Systems 

In § 16 the assumption of a compact sequence A was introduced and shown 
to be convenient for the frequency interpretation, because it leads to the 
simple formula (4, § 16). It is possible to introduce this assumption by a 
logical device that makes its truth analytic: by replacing the class A by the 
universal class 4 V A. The condition Xi e A V A is then tautologically satisfied 
for every element 

To simplify the notation we introduce the rule that the universal class may 
be omitted in the first term of a probability expression. This rule is expressed 

by the definition ._ v 

P(B) = d/ P(A V AjB) (1) 


The probability P(B) may be called an absolute probability f in contradis¬ 
tinction to the relative probabilities so far considered. An absolute prob¬ 
ability can be regarded as a relative probability the reference class of which 
is the universal class. 

If the statement e A is true for all x iy though not analytic, the class A, 
for this sequence, is equivalent to the universal class A V A. If a sequence is 
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compact in 4, the indication of the class A may therefore be omitted, and the 
probabilities may be treated as absolute probabilities. 

The axioms and theorems of the calculus are transferred by the definition 


(1) to absolute probabilities. We find, for instance, 

P(B) + P(B) = 1 (2) 

P(B V C) = P(B) + P(C) - P(P.C) (3) 

P(B.C) = P(B) • P(P,C) = P(C) • P(C,P) (4) 

P(C) = P(B) • P(P,C) + P(B) • P(B,C) (5) 


P(P V C,D) = 

P(P) • P(P,P) + P(C) : P(C,P) - P(B) ♦ P(P,C) ■ P{B .C,D) (6) 

P(P) + P(C) - P(B) • P(B,C) 


and for exclusive events 
P(P V C,Z>) 


P(P) ■ P(g,D) + P(C) - P(C\P) 
P(P) + P(C) 


( 7 ) 


These special forms follow from the general forms (7, § 13), (8, § 20), (3, § 14), 
(2, § 19), (16, § 22), (3, § 22), when AM A is substituted for A. But it is not 
possible conversely to derive the general forms from the special forms; the 
latter hold only for the universal class as reference class, whereas the former 
hold for all reference classes. For the treatment of the relative probabilities 
occurring in the above formulas, therefore, formulas in terms of the general 
reference class A must be used. 1 

If two classes B and C are considered, the complete system of probabilities 
pertaining to them is given by the probabilities 


P(B) 

P(B) 

PiC) 

P(0 

P(B,C) 

P(B,C) 

P(C,B) 

P(C,B) 

P(B,C) 

P(B,C ) 

P(C,B) 

P(C,B ) 


The probabilities of the second and third line will be called mutual probabilities. 
The twelve values (8) are determined by the three fundamental probabilities 

P(P) P(C) P(B,C) (9) 

which are the analogues of the three fundamental probabilities P(4,P), 
P(A,C), P(A.P,C), introduced in (7', §21). The computation is made by 

1 If formulas containing the general reference class A are to be derivable from the corre¬ 
sponding formulas written in the notation by absolute probabilities, a particular rule of 
substitution must be introduced; see rule a, § 82. 
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the use of the relations (2)-(5); among these, (4) supplies the value P(C,B ); 
(5), the values with negated terms in the reference class. 

It was mentioned above that the choice of these values as fundamental 
probabilities is a matter of convention, and that three other independent 
values could be used for the same purpose. From this point of view it is of 
some interest to select three values from the second line in (8) as funda¬ 
mental probabilities. This line contains the two affirmative terms P(B,C) and 
P(C,P), which contain no negation signs, and the two terms of negative refer¬ 
ence P(B,C) and P(C,B), which contain negation signs for reference classes. 
These four probabilities are not independent, but are connected by the 

relation P(B,C) P(C,B) 1 - P(B,C) - P(C,B) 


P(C,B) P(B,C ) 1 - P(C,B ) - P(B,C) 

This relation is derived as follows. We introduce the abbreviations 


( 10 ) 


P(B) = b P(C) = c P(B,C) = c x P(5,C) 
P(C,B) = b, P(C,B) = b 2 


c 2 


( 11 ) 


Applying (4) to the three forms P(B.C), P(B.C), P(B.C), we construct 
the three relations 


bci = rf>i c( 1 — &i) = 02(1 — b) 6(1 — ci) = 6 2 (1 — c) 
Solving the first two relations for b and c, we have 

b\C 2 C1C2 


b = 


c = 


( 12 ) 


(13) 


(14) 


Ci(l — 61) + 61C2 Ci(l — 61) + b\C 2 

Inserting these values in the third relation (12), we find 

Ci 62 b 2 -f- Ci — 1 

61C2 C2 + 61 — 1 

which is the relation (10). 

Equations (13) determine the absolute probabilities as functions of the 
three mutual probabilities P(B,C), P(C,B) f P(B,C), and can be written in 
the form . 

P(B)= P(C,B) ■ P(B,C) 


P(.C ) = 


P(B,C)[1 - P(C,B )] + P(C,B ) • P(B,C) 

_ p(g,c) • nm _ 

P(5,C)[1 - P(C,B)] + P(C,B) ■ P(B,C) 


(15) 


These results show that all the probabilities (8) are determined by the 
three mutual probabilities P(P,C), P(C,P), P(5,C). Exception is to be made 
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for the case that the denominator of (15) (which represents the determinant 
of the corresponding set of linear equations) vanishes. This is the case, in 
particular, for exclusive classes B and C, that is, for P(B,C) = P(C y B) = 0 ; 
(15) then gives the indeterminate form #. 

It is possible to construct a formula that is not subject to this degeneration 
when the probability P(C,B) is included in the arguments. For P(B,C) = 
P(C,B) — 0, formula (10) supplies the form #, and the four values of the sec¬ 
ond line of ( 8 ) are no longer connected by a restrictive condition; thus the 
last term of this line can be added as an independent parameter. For the 
derivation we use (5), first with B and C interchanged, then in the form 
given, thus arriving at the two equations 


b = cbi + (1 — c)b 2 c = 6 ci + (1 — b)c 2 
which, solved for b and c, give the results 

cJ)\ + b 2 (l — C2) b 2 C\ + 02(1 — b 2 ) 


b = 


c = 


1 - (bi - b 2 )(ci - C 2 ) 1 — (&i - b 2 )(c! 

Introducing the exclusion condition 61 = C\ = 0 , we find 


<*) 


b = 


h 2 ( 1 ~ c 2 ) 


1 — 62C2 
These formulas can be written 

P(C,B){ 1 - P(B,C)] 


c = 


Ca(l ~ b 2 ) 
1 — b 2 oi 


P(B ) = 


1 - P(B,C) • P(C } B) 


P(C) = 


P(B,C)[l - P(C,B)] 
1 - P(B,C) • P(C,B) 


(16) 


(17) 


(IS) 


( 19 ) 


They determine the absolute probabilities in terms of the mutual probabil¬ 
ities for the exclusive classes B and C. 

The two affirmative mutual probabilities P(JS,C) and P(C,B) are con¬ 
nected by the relation 

rm ( . 1 r ~i i ; i 

( 20 ) 


P(B,C) P(C) 


P(C>B) P{B) 

which follows from (4). Corresponding relations hold for negated mutual 
probabilities; they follow from (20) by the substitution of B for B and so on. 

In many applications the two absolute probabilities P(B) and P(C) are 
unknown, and only the two affirmative mutual probabilities P(B,C) and 
P(C,B) are given. These values are subject to no restrictions other than that 
their values be between 0 and 1. The two probabilities are sometimes com¬ 
bined in a mutual 'probability implication , which is written in the implicative 
notation 


(B&C) 


( 21 ) 
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This is equivalent to the conjunction 


(B -a- Q . (C K) (22) 

P <i 

As explained, the two values p and q are not subject to any connecting 
condition. It is even possible that p — 1 and q = 0 without B or C being 
empty. The corresponding general implications ( B D C) and (C D B) are com¬ 
patible only if B is empty, since we can derive from them, by the transitivity 
of the implication, the relation {B D B). There is no such consequence for the 
probability implications because the probability 1 is not equivalent to cer¬ 
tainty. Thus P(C,B) = 0 does not exclude the possibility that C is some¬ 
times accompanied by B. 

If the two mutual probabilities are given, the values of the absolute prob¬ 
abilities and those of the probabilities of negative reference are not deter¬ 
mined. Only the ratio of the absolute probabilities is determined, according 
to (20). But it is possible to compute some other probabilities. First, we can 
replace the relation (20), which includes absolute probabilities, by a corre¬ 
sponding relation for relative probabilities. For this derivation, the relations 
(2)-(7) are not sufficient, and we must return to the notation in terms of the 
general reference class A. We apply (4, §21), substitute for A the disjunc¬ 
tion B V C, and use the tautological equivalences 

([5 V C\.B = B ) ([B V C].C s C) (23) 

We thus arrive at the formula 

TO = P(BVC,C) 

P(C,B) P(B V C,B) 1 ' 

The expressions P(B V C,B) and P(B V C,C) may be called disjunctive weights; 
they determine the weight with which either of the terms B or C occurs in 
their mutual disjunction. Formula (24) states that the ratio of the mutual 
probabilities is equal to the ratio of the disjunctive weights. 

There is a further relation, which makes it possible, in combination with 
(24), to determine the disjunctive weights in terms of the affirmative mutual 
probabilities. We have, with the general rule of addition in the A-notation, 

P(B v C,B V C) = 1 = P(B V C,B) + P(B V C,C) - P(B V C,B . C) (25) 
The last term is transformed with (23) into 

P(B V C,B. C) = P{B V C,B) • P([B V C]. B,C) 


= P(B V C,B ) • P(B,C) 


( 26 ) 
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Introducing this result in (25) and substituting for P(B V C,C) the value 
resulting from (24), we arrive at an equation, which, solved for P(B V C,B ), 
gives the result 


P(B V C,B) 


_ P(C,B) _ 

P(B,Q + P(C,B) - P(B,C) • P(C,B) 


(27) 


This relation will be called the general rule of the disjunctive weight. It deter¬ 
mines the disjunctive weight as a function of the affirmative mutual prob¬ 
abilities. 

It is easily seen that the disjunctive weight of C is given by a similar ex¬ 
pression, resulting from (27) when P(B,C) is put for P(C,B) in the numerator. 
The probability of the product results from (26) in the form 


P(B\fC,B.C) 


P(B,C) • P(C,B) _ 

P{B,C) + P{C,B) - P(B } C) • P(C,B) 


(28) 


As for (15), a qualification must be added. Formulas (27)-(28) depend on 
the condition that at least one of the two mutual probabilities is > 0. It fol¬ 
lows that for exclusive classes the disjunctive weights are not determined by 
the affirmative mutual probabilities. 

As before, a computation for exclusive classes is made possible by the use 
of mutual probabilities of negative reference. From (7) we derive, substi¬ 
tuting B for D and putting P(C,B) = 0 because of the exclusion condition, 


P(B V C,B) = 


P(B) 


P{B) + P(C ) 


With the application of (19) we find 


P(B V C,B ) = 


P(CM 1 - P(B,C)] 


P(B y C) + P(C,B) - 2 P(B,C) • P(C,B) 


(29) 


(30) 


This formula, which holds only for exclusive disjunctions, will be called the 
special rule of the disjunctive weight. Since the mutual probabilities of nega¬ 
tive reference used in the formula are sufficient to determine the absolute 
probabilities, a knowledge of the disjunctive weights, for exclusive disjunc¬ 
tions, is inseparable from a knowledge of the absolute probabilities. 

We turn now to probability relations between three classes BiyB^Bz. The 
complete probability system, written only for affirmative terms, is given 
here by the probabilities 

pm pm pm 

P(B h B 2 ) P{B 2j B 8 ) P{B z m (31) 


P(B 2) B0 P(Bs,B 2 ) P(B lf B z ) 

P(B i. B 2) Bz) P(B 2 . B z m P(B Z . B h B 2 ) 
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The other forms result by substitution of B x and so on, in these expressions. 
The probabilities of the last line will be called compound mutual probabilities. 
Those of (he second and third lines are then called simple mutual probabilities. 
Note that the values (31) are not subject to restrictive conditions: it is not 
required that the three classes be exclusive or independent or that they form 
a complete disjunction. 

For any two simple mutual probabilities, formula (4) leads to the relation 

P(B k ,Bi) P(B i) } 

The probabilities of the third line of (31) are thus determined by those of 
the first and second lines. Probabilities having B x or B k in the reference class 
are derivable from the affirmative terms by means of (5). 

The three compound mutual probabilities are connected by the relations, 
following from the rule of the product, 


I J (Bj.B k ,B m ) = P(B it B m ) 

P{B t .B rn ,B k ) P(B { ,B k ) { } 

The three relations resulting for m = 1, k = 2; m — 2, k = 3; and m — 3, 
k — 1, are not independent, because the last is easily seen to be a conse¬ 
quence of the other two. Thus (33) represents two independent relations. 
If one of the affirmative compound mutual probabilities is given, the other 
two are thus determined. Probabilities with terms Bi in the reference class 
are computed from the affirmative terms by means of the relations (5) and 
0°, § 19). 

The complete probability system for three classes is thus determined by 
the six values of the first two lines of (31) and, besides, one of the compound 
values of the last line of (31), that is, by seven independent probability values. 

If the absolute probabilities P(Z?i), P(B 2 ), P(Bf) are unknown, the six 
values of the simple mutual probabilities in the second and third lines of 
(31) cannot be assumed arbitrarily, but are connected by the relation 


P(B h B 2 ) P(B 2 ,Bz) PjB^BO = 

P(B 2 ,B x ) ’ P(B h B 2 ) ’ P{B h B z ) 

which follows by the use of (32) and can be written in the form 2 


P(B h B 2 ) • P(B 2 ,B s ) • P(B h B,) = P(B h B z ) • P(B h B 2 ) • P(B if B x ) (35) 

Only five of these values, therefore, are independent. Formula (35) will be 
called the rule of the triangle. It states that in a triangle BiB 2 B s the product 
of the three mutual probabilities is the same, whether we go clockwise or 

2 This relation was pointed out by Norman Dalkey, “The Plurality of Language Struc¬ 
tures/* doctoral dissertation, University of California at Los Angeles, 1942. 
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counterclockwise around the triangle. For two events there is no such de¬ 
pendence of mutual probabilities, because in this case there is only one direc¬ 
tion for the “round trip”. The distinction of two such directions begins with 
three events. 

The relation (35) has a simple explanation in the frequency interpretation; 
it represents the identity 


. B 2 ) N n (B 2 . B*) N n (B 3 . B x ) 
N n (B{) ’ N n (B 2 ) ‘ N*(B Z ) 


_ NHBlB*) N»(B,. B 2 ) N ti (B 2 .Bi ) 

JV»(Bi) ' ' N n (B 2 ) ( 

The rule of the triangle (35) is automatically satisfied if the absolute 
probabilities in combination with the second line of (31) are used for the 
determination of the values of the third line. But if the absolute probabilities 
are not used and, instead, five of the simple mutual probabilities are assumed 
arbitrarily, they are subject to the numerical restriction 


P(B h B 2 ) • P(B 2 ,Bs) • P(B h B x ) g P(B s ,B 2 ) • P{B 2} B,) (37) 

which formulates the condition P(B h B s ) ^ 1 for computation of this prob¬ 
ability from (35). This inequality is to be added to the inequalities (15, § 19). 

The following special conditions can be derived. If P(B h B 2 ) — 1 and 
P(B 2j Bz) = 1, it follows that P(B h B 3 ) — 1 if P{B 2 ,B X ) > 0. This transitivity 
is shown by the considerations added to (7, § 19). Another rule of transitivity 
is as follows: if transitivity holds in one direction of the triangle, it also holds 
in the other. This theorem, which applies, too, when the probabilities are 
< 1, is derivable from (35), because when we put there 


P{B h Bz) = P(B U B % ) • P(B 2 ,B,) 

we have 

P(B h B x ) = P(B s ,B 2 ) • P(BM 

The values (37) determine the ratios of the absolute probabilities, accord¬ 
ing to (32). If a further condition is added, for instance, that the disjunction 
of the three classes be complete, the absolute probabilities are determinable. 
Instead of such a condition for the absolute probabilities, it is sufficient to 
give one simple mutual probability of negative reference. The computation 
of the absolute probabilities then follows the methods developed for two 
classes. 

If, besides the values (37), one compound probability and one simple mu¬ 
tual probability of negative reference are given, all the other probabilities 
can be computed by the methods developed for two classes. 
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For three classes, the problem of a disjunctive reference class offers par¬ 
ticular interest. The problem will first be treated for nonexclusive reference 
classes B\ and B 2 , for which case it can be solved in terms of affirmative 
mutual probabilities. When we insert the values (15) in (6), the denominator 
of (15) drops out and the term P(P,C), which occurs in every term, can be 
canceled. Putting j?i, B 2 , B Zf respectively, for B , ( 7 , D , we arrive at the formula 

P{Bi VP 2 ,P 3 ) = ( 38 ) 

P(B h B 2 ) - P{B 2 ,B Z ) + PCgygi) • P(B h Bz) - P(B h Bi) • P(B 2y B,) ■ jP(gi ■ B 2 ,B Z ) 
P{B h B 2 ) +P(B 2 ,B l ) - P(B l} B 2 ) • P(B 2 ,Bij~ 

The formula differs from the general rule of reduction, in the forms (16, § 22) 
or (6), in that it includes neither absolute probabilities nor a term A common 
to all reference classes. Instead, the terms B\ and B 2 of the disjunction are 
distributed into the first terms of the expressions on the right; we therefore 
call (38) the general rule of distributive reference . The occurrence of the term 
P(Bi .B 2 ,B z ) shows that the solution requires one compound mutual prob¬ 
ability; but all the terms are affirmative mutual probabilities. 

For exclusive classes, again, a different solution is necessary, because, for 
P(B h B 2 ) = P(B 2f Bj) = 0, formula (38) gives the form %. As before, the prob¬ 
lem is solved by the use of probabilities of negative reference. Starting with 
(7), we insert the values of P(B) and P(C) from (19); we then substitute 
B 1} B 2 , B z , respectively, for B , C, D, and arrive at the result 

P(B^B 2y B z ) = (39) 

P{B h Bf) • PCgyBi) - [1 - P(Bi,B 2 )] + P(B 2 ,B 9 ) • P(Bi,B 2 ) • [1 - P(B 2 ,BQ] 
P(B u B 2 ) + P{B 2 ,B l )-2P(B h B<f) • JiS^T 

This is the special rule of distributed reference , which holds only for exclusive 
disjunctions. It does not require the use of compound mutual probabilities, 
but presupposes terms of negative reference. Note that, in contradistinction 
to previous theorems to which similar names were assigned (§§ 14, 20, 22), 
the two special rules (30) and (39) do not follow from the two general rules 
(27) and (38) as special cases, but require separate derivations. 

The considerations can be extended to n classes B\ . . . B n . For every 
subset of m classes there exist compound mutual probabilities of the form 
P(Bki ■ . . Bk m —ijPfcm)* They are connected with those of the next lower 
subset by the relations 

m, ■ • • B km _ lt B km ) = P(B tl ■ ■ . B km _„B km ) (40) 
P(B kl . . . P(B kl ■ • • B km _ t ,B km _0 

which express the rule of the product. Given one probability of the subset, 
all the others are thus determined in terms of those of the next lower subset. 
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Probabilities having a term B kr in the reference class are determined by the 
use of the rule of elimination (10, § 19). 

The total number p of independent probabilities determining the complete 
probability system for n classes is computed as follows. First, the n absolute 
probabilities P(J5*) can be given independently. Second, of each subset of m 
classes one probability must be given. This includes the case m = 2, for 
which we have the simple mutual probabilities; for every combination 
P(Bi,B k ), the converse probability follows from (32) in terms of the absolute 
probabilities. The relation (32) is not restricted to three classes and is a special 
case of (40) resulting for m = 2. The number of subsets of m classes among n 


CD 


classes being I ) we find, since n 


(")■ 
- 2 - 1 

using the familiar theorem for binomial coefficients 

i:( n W 

m^O \ m ) 


(41) 


(42) 


For n — 2 we have /x = 3, according to (9). For n = 3 we have /x = 7, in 
correspondence with the above result. 

The number v of affirmative probabilities can be found as follows. For 
every subset of m classes, there are m affirmative probabilities, which result 
when, one after another, each of the classes is chosen as attribute class. This 

is true, too, for m — 2 and m = 1 . There being subsets of m terms, we have 


m 


y (» - 1)! 

m-1 (m - l)!(n - m)\ 


= 


(n - 1)1 


W(m- l)![(n- 1) - (ro- 1)]! 




(43) 


For n = 1 we have v = 1; for n = 2, v — 4; for n — 3, v = 12, in corre¬ 
spondence with (9) and (31). 

The number p of probabilities of the complete system, all being of the form 
occurring in (40), but including both affirmative and negated terms, is com¬ 
puted as follows. Each affirmative probability of m terms contributes 2 m into 
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the total, since each term can be written once without, and once with, a 
negation sign. So we have, using the preceding transformation of the sum, 


P = 



m 


2 m 



. 2 m ~ 1 


= 2n(2 + l) n_1 = 2 n • 3"” 1 


(44) 


For the transition to the second line we use the binomial theorem 


(p + ?)" 



(45) 


choosing p = 2, q = 1, and putting n — 1 for n and m — 1 for m. For n = 1 
we have p = 2; for n = 2, p — 12 (see 8); for n = 3, p = 54. 

There are many applications for the relations developed for three classes. 
Let Bi be a symptom of illness; B 2 , a certain disease; P 3 , the case of death. 
The simple mutual probabilities may be known from statistics; the relation 
(35) shows that only five are to be ascertained, the sixth being determinable. 
Furthermore, one of the compound probabilities must be ascertained, for 
instance, P(Pi.P 2 ,P 3 ). When these values are known, all statistical ques¬ 
tions referring to the three classes are answerable except those referring to 
absolute probabilities or probabilities of negative reference. A psychological 
application obtains when Bi means a certain stimulus; B 2) a perception; B h 
a certain reaction of a person. 


§ 25. Remarks Concerning the Mathematical Formalization 
of the Probability Calculus 

Having carried through, to a large extent, the formalization of the calculus 
of probability, we are now free to discuss this procedure from a logical view¬ 
point. The “logification’’ by which this construction of the calculus was intro¬ 
duced has, in the meantime, been transformed into a “mathematization”, a 
notation in which the logical operations are restricted to the inner part of 
the P-symbols. The resulting complexes of the P-symbols, into which these 
symbols enter as units, have the character of mathematical equations. Thus 
the probability calculus acquires a form that is convenient for the purpose of 
carrying out calculations. 

This manner of writing—the mathematical notation—has the disadvantage 
that it cannot express certain relations of a nonmathematical kind that hold 
within the probability calculus. There are three different forms of such 
relations: 

1. The dependence of a mathematical equation on the validity of another 
mathematical equation, that is, the implication between equations. An exam- 
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pie is given by the assertion that (4, § 14) is the condition of validity for 
(5, § 14). 

2. The dependence of a mathematical equation on a nonmathematical 
condition. Of this kind is the condition ( A D B) in axiom ii,1 or the condition 
of exclusion in axiom m. 

3. Logical properties of the quantities occurring, such as are expressed in 
the statement of univocality formulated in axiom i. 

The first case* is irrelevant because the existence of a logical implication 
between equations is easily expressible by some connecting words in the 
context. This is the method usually applied in mathematics. The second 
case is serious because here the condition on which the validity of a mathe¬ 
matical equation depends is not expressible in mathematical notation. The 
third case is irrelevant again. It concerns only the assertion of univocality; 
this assertion, as is usual in mathematics, may be added in words. 

It will now be shown that the second difficulty can be eliminated by the 
use of a method that translates the logical condition into a mathematical 
condition. The method may bo illustrated by reference to the general theorem 
of addition. It was seen, in § 20, that the condition of exclusion, written for 
this axiom in the logical notation, could be replaced by the condition that 
the corresponding probability becomes 0. Since this assumption, according to 
(9, § 13), states less than the strict condition of exclusion, a certain generaliza¬ 
tion of the special theorem of addition has thus been constructed. It will now 
be shown that the same procedure is feasible in some other places, so that, 
by its use, relations of the form 2 can be reduced to those of the form 1. We 
arc concerned here with the following theorems written in the implicationai 

n0tatl0n: (A 3 B) 3 [(A C) = (A.B -3- C)] (1) 


(A 0B)D(3p)(A.C*.B).(p = 1 ) 

V 

01 3 B) 3 [( A e- C) = (A -3-B.C)] 


(C 3 B) 3 [(A C) = (A^B.C)} 

« 9 


( 2 ) 

(3) 

(4) 


The proof of the theorems is easily given. Let us prove immediately their 
generalization for the cases P(A,B) — 1 and P(C,B) = 1, respectively. From 
this result, of course, by the help of ii,1, theorems (l)-(4) follow. 

Instead of (1) we obtain: if P{A,B) — 1, then 

P(A f C) = P(A.B,C) (5) 

This follows from the elimination theorem (2, § 19) because P(A,B) = 0. 
Formula (5) states that, if P(A,B) = 1 , B and any C are mutually independent 
with respect to A . 
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Instead of (2) we obtain: if P(A,B) = 1 and P(A,C ) > 0, we have for any C 


P{A.C y B) =1 


(6) 


The proof follows from the product rule (6, § 21) by the use of (5). If P(A,C) 
= 0, (6) is not derivable; in this case the value of P(A.C,B) cannot be deter¬ 
mined from P(A,B), though it is possible that a determinate value P(A .C,B) 
exists. 

Instead of (3) we obtain: if P(A,B) = 1, then 

P(A f C) = P(A,B . C) (7) 

The proof is given by the multiplication theorem (3, § 14) with the help of 
(6). This formula is also valid for the case P(A,C) — 0, since it then follows 
directly from the multiplication theorem without the use of (6). 

Instead of (4) we obtain: if P{C y B) = 1 and P(C y A) > 0, then 

P(A,C) = P(A y B.C) (8) 

The proof is given by the multiplication theorem, since, for the assumptions 
made, P(A.C } B) = 1, according to (6), if A and C are interchanged in (6). 

With these proofs the mathematical formalization is carried through for 
the four theorems. 1 It will now be shown that in axiom n,l, too, the condi¬ 
tion (A D B) can be formally eliminated. 

In this case we make use of the equivalence 

([A D B] m [B m A V B]) (9) 


The formula is proved by solving the right side according to (76, § 4), apply¬ 
ing (4e, § 4) and (46, § 4) and, finally, transforming the left side by (6a, § 4). 
Formula (9) is a tautology of the class calculus, that is, the expression inside 
the parentheses represents the universal class. The formula can be tran¬ 
scribed into the form 

(4DB) = (B^4VB) (9') 


This formula means that if A is a subclass of B, the joint class A V B is iden 
tical with B. 

Because of (9), axiom n,l is equivalent to the expression 


P(A f A VJB) - 1 
For we have, on account of (9), 

(4DB)D [P(A,B) = P(A,A VB)] 


( 10 ) 

( 11 ) 


1 The restrictions P(A,C) > 0 and P(C,A) > 0 added, respectively, to (6) and (8) are 
not required for the corresponding theorems (2) and (4). This is due to tne fact that the logi¬ 
cal implication represents a stronger assumption than the probability 1. 
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Therefore, n,l follows from (10). That, conversely, (10) follows from n,l 
can be shown by the use of (8a, § 4). 

As a formula that cannot be formalized mathematically, there remains 
only axiom i, the axiom of univocality, apart from implications of the form 1. 
All other expressions can be formalized, and we may ask whether we can 
omit axiom i and construct a mathematical axiom system of the calculus of 
probability. By this term is understood a system in which the logical opera¬ 
tions are restricted to the inner part of the P-symbols, whereas the symbols 
themselves enter into relations having the form of mathematical equations. 
Such an axiom system can be constructed; the condition of univocality is 
then added in words. 

In order to set up this axiom system, we introduce the following changes 
from the axiom system written in the implicational notation. We replace 
ii,1 by (10). Furthermore, we replace the addition theorem hi by the general 
theorem of addition (8, § 20), so that we can free ourselves from the condition 
of exclusion. This requires, in the group of axioms of normalization, a further 
axiom, a,2, which defines the probability 0 in a way similar to that in which 
a,l defines the probability 1. We thus obtain the following mathematical 
axiom system of the calculus of probability: 

a ) Axioms of normalization 

1. P(A,A V B) = 1 

2. P(A,B.B) = 0 

3. 0 g P(A,B) 

P ) Axiom of addition 

P(A,B V C) = P(A,B) + P(A,C) - P(A,B.C) 

y ) Axiom of multiplication 

P(A,B.C) = P(A,B) • P(A .B,C) 

Axiom a,3 needs no qualification demanding that A be nonempty, because, 
if A is empty, this inequality does not represent any restriction on the numer¬ 
ical values of probabilities. According to the convention concerning the use 
of the P-symbol for empty reference classes (see p. 59), the inequality a,3 
expresses, in this case, merely a trivial existential statement. Thus the axiom 
does not depend on a special condition to be added in words. The only condi¬ 
tion of this kind is the axiom of univocality. It may be convenient to formu¬ 
late this axiom, together with the rule of existence and the convention about 
the use of P-symbol, as a rule given in the metalanguage. 

It will be shown briefly how the rule of the complement (7, § 13) can be 
derived from these axioms. We substitute first, in a, 1, B V B for B; because, 
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according to (8c, §4), the relation (A D B V B) is always valid, we obtain, 
by the use of (11), R ^ = 1 

Dissolving the term on the left side according to axiom fi, we obtain, by the 
use of a,2, p ^ A B v/5) = + p(A,B) = 1 (12) 

From this result we derive the special theorem of addition for the mutually 
exclusive events B and C, that is, for (B D C). Since ( B D C) == (B. C), the 
relation (A D B.C) follows from ( B D C) with the help of (8c, § 4); substituting 
in (11) B.C for B and using axiom a, 1 we thus derive 

P(A,7T7C) = P(A,A V BTC) = 1 (13) 

With the help of (12) we now derive P(A,B.C) — 0 and thus obtain from 
axiom 0 the special theorem of addition. 

Regarding the theorem of multiplication, the previous remarks hold good, 
according to which this theorem can be replaced by the weaker assumption 
of § 15. It is possible, furthermore, to replace the axioms /3 and y by a com¬ 
pound axiom, as was shown by William Gustin. According to Gustin, the 
following mathematical axiom system is sufficient: 

a ) Normalization , A r> 

' 1. P(A.B,B) = 1 

2. 0 ^ P(A,B) 

b ) Axiom of the complement of the product 

P(A,B~C) = 1 - P(A,B) • P(A.B,C) 

The postulate of univocality must be added in words, as in the preceding 
system. The Gustin system shows that the axiom of addition can be replaced 
by the rule of the complement and that the latter can be combined with the 
axiom of multiplication in one axiom. In this system the rule of the comple¬ 
ment is derivable as follows: 

P(A,B) = P{AJTB) = 1 - P(A,B) • P(A BjB) = 1 - P(A,B) (14) 

Using this result, wo immediately derive from axiom b the general theorem 
of multiplication. The general theorem of addition is proved as follows: 


P(A,B V C) - P(A,B.0) - 1 - P(A,5) - P(A.B,C) 
= 1 - P(A,B) • [1 - P(A.B } C)] 

= P(A,B) +P(A,B.C) 

= P(A,B) + P(A,C) • P(A.C,B) 

= P(A,B) + P{A y C) • [1 - P{A.CM 
= P{A,B)+P<A,C)-P{A,B.O 


(15) 
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Axiom a,2 follows when B is substituted for C in the theorem of addition, 
and the rule of the complement is used. Axiom a,l follows by substituting 
A for C in the theorem of addition. The Gustin system is thus proved to 
be equivalent to the system of axioms ac-^y. 

The mathematical axiom systems presented here are sufficient to prove all 
the theorems of the calculus of probability. They do not carry through the 
formalization completely; the condition of univoeality and the implication 
between equations must be added verbally in the context. But the nonformal- 
ized residue is relatively small. It is true that my mathematical axiom systems 
require the use of symbolic logic for the inner part of the probability symbols, 
but I hope that the presentation shows that this feature only facilitates 
operations within the calculus. With the help of symbolic logic a probability 
calculus has been constructed that exhibits not only the mathematical but 
also the logical structure of its subject matter. I should be happy if the 
unification of mathematics and symbolic logic thus achieved would stimulate 
other authors to attempt similar constructions in other fields of research. 

Historical remark concerning the axiomatic construction of the calculus of probability. —Axio¬ 
matic foundations of the calculus of probabilities have been given repeatedly within the 
last few decades. Corresponding to my division into a formal and an interpreted theory of 
probability, two groups may be distinguished. The interpreted form of axiomatic construction 
regards probability, from the beginning, as a frequency, and derives from this interpreta¬ 
tion, by the possible inclusion of additional postulates, the rules of the theory. This group 
began with Richard von Mises’ analyses 1 (1919) and was continued by the inquiries of Karl 
Dorge 2 (1930), Erhard Tornier* (1930), and Erich Kamke 4 (1932); it includes, also, the 
investigations by Arthur II. Copeland 6 (1928). 

The formal conception introduces the concept of probability by the method of implicit 
definitions, and uses no properties of the concept other than those expressed in a set of 
formal relations placed as axioms at the beginning of the theory, leaving open various 
possibilities for its interpretation. The group includes the axiom system given in 1901 by 
Georg Bohlmann 6 and the analyses published by 8. Bernstein 7 (1917) and Emile Borel 8 
(1925). To it belongs also my own axiomatic presentation, which was first published in 
1932. 9 It was followed by an axiomatic construction by A. N. Kolmogoroff 10 in 1933. 

1 “Grundlagen der Wahrscheinlichkeitsrechnung,” in Math. Zs., Vol. V (1919), pp. 52-99; 
Vorlesungcn aus dem Gebiete der angewandten Mathematik, Vol. I: Wahrscheinlichkeitsrech¬ 
nung . . . (Leipzig, 1931). 

2 “Zu der von R. v. Mises gegebenen Begrundung der Wahrscheinlichkeitsrechnung,” in 
Math. Zs., Vol. 32 (1930), pp. 232-258. 

3 “Eine neue Grundlegung der Wahrscheinlichkeitsrechnung,” in Zs. f. Physik, Vol. 63 
(1930), p. 697; “Grundlagen der Wahrscheinlichkeitsrechnung,” in Acta Math., Vol. 60 
(1933), pp. 239-380. 

4 Einfuhrung in die Wahrscheinlichkeitstheorie (Leipzig, 1932). 

6 “Admissible Numbers in the Theory of Probability,” in Amer. Jour. Math., Vol. 50, No. 
4 (1928), p. 535; and later papers. 

6 Encykl. d. math. Wiss ., Vol. 1, Part 2 D 45 (1901), pp. 852-917. 

7 “Versuch einer axiomatischen Begrundung der Wahrscheinlichkeitsrechnung,” in Mitt, 
d. math. Ges. Charkow (1917), pp. 209-274. 

8 TraiU du calcul des probability (Paris, 1924-), Vol. I, Part 1; Principes et formules das- 
siques du calcul des probability (Paris. 1925). 

9 “Axiomatik der Wahrscheinlichkeitsrechnung,” in Math. Zs., Vol. 34 (1932), pp. 568-619. 

10 Grundbegriffe der Wahrscheinlichkeitsrechnung (Berlin, 1933). 
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Most inquiries of the formal group omit the development of a theory of the order of 
probability sequences. The problem was first attacked by von Mises, Dorge, and Copeland 
in articles applying the interpreted conception, whereas my presentation has shown that 
the problem can be dealt with even within the formal conception. The next chapter deals 
with the differences between my presentation and those of von Mises, Dorge, Tornier, and 
other authors. These differences result from the fact that my theory develops a system 
comprising all types of order, whereas the other systems are restricted to special types. 

A third line of development, going much further back historically than the axiomatic 
inquiries, connects the treatment of probability with the methods of symbolic logic This 
line can be traced to Leibniz, 11 whose program of a mathematical logic included that of a 
logic of probability. The idea of construing probability as a relation between statements, 
which includes logical implication as a special case, was proposed in 1837 by Bernard Bol¬ 
zano. 12 British and American logicians have followed a similar course. In his fundamental 
work introducing the period of modern logic, George Boole 13 (1854) included a logic of 
probability; he was followed by John Venn 14 (1866), Charles S. Peirce 14 (1878), and John 
M. Keynes 16 (1921). The latter work, besides combining symbolic logic with the calculus 
of probability, contains a report on earlier attempts at constructing such a calculus. In 
this group belong also the publications of Harold Jeffreys. 17 

My own presentation undertakes to unite the axiomatic method with the construction of 
a logico-mathematical calculus, which I developed without a knowledge of the calculi pub¬ 
lished much earlier by the authors cited. My theory of probability implication originated 
within the context of inquiries into the nature of causality. 18 The table of rules of probability 
implication given there is to be replaced by my present formulation. A summary of my 
theory of probability was published in French. 19 

11 See the presentation by Louis Couturat, La Logique de Leibniz . .. (Paris, 1901), pp. 
239-250. 

12 Wissenschaftslehre (1837; Leipzig, 1929—), § 161; see also the remark by Walter Dubislav, 
in Erkenntnis , Vol. I (1930), p. 264. 

15 The Laws of Thought (London, 1854). 

14 The Logic of Chance (London and Cambridge, 1866). 

n The Doctrine of Chances (1878), printed in Collected Papers (Cambridge, Mass., 1932), 
Vol. II, p. 395. 

16 A Treatise on Probability (London, 1921). 

17 Theory of Probability (Oxford, 1939). 

18 “Die Kausalstruktur der Welt und der Unterschied von Vergangenheit und Zukunft,” 
in Ber, d , bayer. Akad.. Math.-phys. Kl. (1925), p. 144. 

11 “Les Fondements logiques du calcul des probabilitds,” in Ann. de VInst. Henri Poincare , 
Vol. VII, Part 5 (1937), pp. 267-348. 
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APPENDIX TO CHAPTER 3 

. Exercises 

Problem 1 

According to official statistics from 1937, published by the National So¬ 
cialist government, Germany had 06,031,580 inhabitants (A); among them 
were 502,799 Jews (J) and 325,541 sentenced criminals (C). Among the latter 
category were 1,794 Jews. What is the probability 

a) that a German Jew is a criminal? 

b ) that a non-Jewish German is a criminal? 

c) that a German criminal is a Jew? 

d) that a non-criminal German is a Jew? 

For the solution use the frequency interpretation directly. 


Problem 2 

Out of 1,000 unmarried men who are 20 years old (A), 28.3 die in that 
year (. D ). Among these, 6.1 die from tuberculosis (T), and 6.6 die from acci¬ 
dents (C) (German statistics of 1937). What is the probability 

a) that a man 20 years old dies from tuberculosis or accident? 

b) that a reported case of death of a man 20 years old is due to tuber¬ 

culosis? 

c) that a reported case of death of a man 20 years old is due to tubercu¬ 

losis or accident? 

For the solution use the frequency interpretation directly. 

Problem 8 

Throwing (A) with two dice distinguished as B and C, what is the prob¬ 
ability of getting a number smaller than 5 on die B or a number greater than 
4 on die C? 


Problem 1+ 

Urn A contains 10 slips showing the number 1, 20 slips showing the number 
2, 30 slips showing the number 3. Urn Bi contains 30 black and 50 white 
balls; urn B 2 contains 50 black and 50 white balls; urn B z contains 60 black 
and 20 white balls. The drawing is made as follows. A slip is drawn from 
urn A. The number obtained determines with which urn B< to continue, and 
a ball is then drawn from that urn. 

a) Determine the probability of getting a white ball (C). 

b) If a white ball has been drawn, it being unknown from which of the 

urns B{ it was obtained, what is the probability that it was drawn, 
respectively, from urns B h B 2> B z ? 

Problem 5 

Mr. Smiths gardener is not dependable; the chances are 2 to 1 that he 
will forget to water the rosebush during Smith's absence. The rosebush is in 
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a questionable condition; it has even chances of recovery if it is watered, but 
only 25% chances of recovery if it is not watered. Upon returning, Smith 
finds that the rosebush has withered. What is the probability that the gardener 
did not water the rosebush? 

Problem 6 

Lady Catherine’s poodle has been missing for five days. There are only 
three explanations. Either the poodle went to the town, in which case there 
is a 3% chance that he was run over by a car, and a 50% chance that he was 
taken to the dog pound; or he went astray in the woods and was accidentally 
hit by a hunter, the chance of such an accident being 1%,; or he went to the 
village and was stolen by gypsies, who in the last year had stolen five out of 
ten stray dogs. On a dozen previous occasions the dog had been found four 
times in the town, twice in the village, and six times in the woods. An inquiry 
at the dog pound showed that he was not there. He had never been absent 
more than three days at a time. 

a) What is the probability that the dog was stolen by gypsies? 

b) If no inquiry at the dog pound had been made, what would be the 

probability that the dog was stolen by gypsies? 

Problem 7 

Under the present conditions (A) the chances for election as governor are 
i for Brown (li), | for Jones (J), ] for Robinson (/£). Should Brown be elected, 
the chances for the construction of a certain highway (C) are 60%; the 
chances are 80% if Jones is elected; and 20% if Robinson is elected. 

a) On the basis of the present conditions, what is the chance that the 

highway will be constructed? 

b) On the evening of the election, before the counting of the votes, Jones 

dies of apoplexy because of excitement. A simple majority will decide 
the election. What is the chance now that the highway will be con¬ 
structed? 
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Solutions 

Problem 1 

S n , A T„s N n (A .J.C) 1,794 0 „ iL 

a) P(A.J,C) = - - —— = - = 3.57 per thousand 

7 ' N n (A.J) 502,799 

b) P(A J C) = Nn(A - J -V = (325,541 - 1,794) 

' 1 ' ’ ’ N n (A. J) 60,031,580 - 502,799 

= 4.94 per thousand 


c) P(A.C,J ) = 


N n (A.C.J ) 
N n (A. C) 


1,794 

325,541 


= 5.52 per thousand 


d) P(A.C,J) = 


N"(A.C.J) 
N n (A.C) 


502,799 - 1,794 
66,031,580 - 325,541 


= 7.64 per thousand 

The figures show that criminality among Jews is smaller than among non-Jews. 


Problem 2 

.) PiAfi.T.D.C) - r 0.00,27 


N"(A) 


10,000 


.) PiA.D.T VC) - - f Ll±JL? . 0.45 


N n (A.D) 


28.3 


Problem 3 


Notation: B b = number smaller than 5 on die B\ C 4 = number greater 
than 4 on die C. 


Problem 4 


P(A } B b VC 4 ) = P(A,B b ) + P(A,C 4 ) - P(A,B 5 .C 4 ) 

= I + I — § • I = l 


P(A,C) = £p<AM ■ P(A.B„C) 


I . 5 I 1 . 1 I 1 . 1 _ 19 

6 8 13 2 12 4~“ 18 

3 


( ' ’ *' hp{A,Bi) • P{A.B { ,C) 
P(A.C,B0 = A P(A.C,B t ) = t 8 s- 


P(A.C,B 3 ) = * 
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This and problem 6 are examples of a formalization of an informal use of 
probability rules—in particular, of the rule of Bayes. The numerical values 
used should be considered as rough estimates of the probabilities concerned. 
The inference then leads to values that have some significance, at least 
qualitatively, and correspond to instinctive appraisals of probabilities made, 
for example, by detectives or other experts in indirect evidence, in situations 
of the kind described. 

Notation: A = the situation before Smith's voyage: W = the watering of 
the rosebush; D — the withering of the rosebush. 

P(A,W) - | P(A,W) = § P(A . W,D) = h P(A. W,D) = f 


P(A.D,W) 


_ P(A,W) • P(A.W,D) _ 

P(A,W) • P(A . W,D ) + P(A,W) • P(A. W,D ) 


Problem 6 

Notation: A — general situation after the poodle's disappearance, but not 
yet including a statement that an accident has occurred; T = the poodle's 
going to the town; V = the poodle's going to the village; W = the poodle's 
going to the woods; D — the poodle's being in the dog pound; C = the 
poodle’s having an accident of any kind, including the case of his being stolen 
by gypsies. The following values are given: 

P(A,T) = A P(A,V) = A P(A,W) = r \ 

P(A . T,C) = P(A . T,D) = P(A . 7,C) = A 

P(A.TF,C) 


Question a: The poodle is not in the dog pound. Because he has never been 
absent for more than three days but has now been missing for five days, we 
consider the assumption of an accident as true. The assumption that he was 
stolen by gypsies is equivalent to his having gone to the village and having 
an accident, that is to F.C. Therefore the probability sought for is given by 


P(A.C,V) 


P{A y V) • P(A.V y C) 
P(A,C) 


= 85% 


where P(A f C) = P(A,T) • P(A.T,C) + P(A,V ) • P(A.7,C) + P(A,W) 


-P{A.W,C)-M f 
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Question b: Here the situation is characterized by C V D, and the rule of 
reduction must be applied: 


P(A.[C VZ)],F) 


P(A,C) ■ P(A.C,V) + P(A,D) • P(A.D,V) 
P(A,C) + P(A,D) 


Now P(A .D,V) - 0 because the dog pound is not in the village but in the 
town. Furthermore, we have 

P(A,D ) = P(A,T ) • P(A.T,D ) + P(A,T) ■ P{A.T,D) 

Since the dog pound is in the town, P(A.T,D ) = 0. Therefore 

P(A,D ) = P(A,T) ■ P(A.T,D) = A * ^ = l 

and the probability asked for is given by 

P(A.[C VD],7) = 31.5% 

This result shows that the probability of an accident is considerably smaller 
so long as there is a chance that the poodle is in the dog pound. 


Problem 7 
Question a 

P(A,C) 

- P(A,B) ■ P(A.B,C ) + P(A,J) ■ P(A J,C) + P(A,R ) • P(A.R,C) 

= U = 67.5% 

Question b 

P(A,B) • P(A.B,C ) + P(A,R) ■ P(A.R,C) 


P(A .[B V R],C) = 


P{A,B) + P(A,R) 


A - 46.6% 
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THEORY OF THE ORDER OF 
PROBABILITY SEQUENCES 

§ 26. The Task of the Theory of Order 

The first part of the probability calculus, which thus far has been presented, 
is the elementary calculus of probability. It deals with the logical elements of 
the probability calculus, just as the elements of geometry comprise the logical 
foundations of geometry. In the elementary calculus, probability sequences 
are treated with respect to their external connections . Sequences to which 
degrees of probability are attached play in it the part of units the mutual 
relations of which are investigated, without regard to the internal structure 
of the units. The second part of the probability calculus, which we now 
begin, is concerned with the internal order of probability sequences. The ele¬ 
mentary calculus is concerned with probability sequences only with respect 
to the one property of possessing a probability, or, in the frequency inter¬ 
pretation, of having a limit of the frequency; but no statements are made 
about the manner in which the elements of such sequences follow one another. 
As a consequence of this procedure, sequences of very different types of order 
are subsumed under the name of probability sequence. Thus we regard a 
strictly alternating sequence 

BBbBBBBBB . . . ( 1 ) 

as a probability sequence because the frequency of B approaches a limit, 
which in this case is = \. But statistical sequences of natural events show a 
different structure. For instance, they may have the form of the sequence 

BBBBBbBBBBBSBBBBbBBBB... ( 2 ) 

obtained by tossing a coin. Both the sequences (1) and (2) go toward the limit 
$ of the frequency; however, the sequence (1) shows a strict order, whereas 
the sequence (2) is irregular, that is, is a random sequence. Since randomness 
is an essential feature of a number of very important problems of probability, 
the structure of a sequence like (2) must now be described by logical means. 
Of this kind are the problems investigated by the theory of the order of prob¬ 
ability sequences. We shall deal not only with types of extreme order as given 
by (I) and (2); we shall become acquainted also with a number of intermediate 
types which are not completely irregular like (2), but which do not possess 
the strict order of (1). Sequences of type (2) will also be called normal se¬ 
quences; the other types are called, correspondingly, nonnormal sequences . 
Further distinctions among the latter will be specified subsequently. 


[131] 
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To carry out this task we require, apart from the concepts already devel¬ 
oped, some conceptual means with the help of which the internal structure 
of probability sequences can be characterized, and, furthermore, another 
group of axioms that is not used in the elementary probability calculus. Like 
the other axiom groups, the new group can be shown to follow logically from 
the frequency interpretation and thus to require of probability sequences only 
such characteristics as are expressed implicitly by the frequency interpre¬ 
tation. 

An essential feature of my theory of order is that it deals with all possible 
forms of probability sequences and is not restricted to sequences of one type 
of order, such as normal sequences. In this respect my probability theory 
differs from others—in particular, from that developed by IL von Mises. 
Such theories regard randomness as an essential characteristic of the very 
concept of probability; and they contend that the meaning of probability 
cannot be exhaustively formulated without reference to randomness. But we 
shall find that such a restriction of the concept of probability is not expedient; 
it artificially creates sharp boundaries in a region in which more or less general 
types occur side by side in a natural order of types. Those who regard ran¬ 
domness as a necessary feature of all probability sequences forget that a 
sequence like (1) represents merely the limiting case of a partially ordered 
type of sequence that still shows, in a less rigid form, rather important char¬ 
acteristics of “disorder” (see § 33). It is possible, in the theory of order, to 
represent by the same conceptual means all the different types of probability 
sequences lying between the extreme cases (1) and (2). Therefore I believe 
my general probability calculus to be preferable to any special calculus that 
is restricted to a certain type of order. Linguistic usage, too, has decided in 
favor of a more general concept of probability. In physics an important role 
is played by sequences with a probability aftereffect, which represent a type 
intermediate between (1) and (2). 

For these reasons we develop the theory of order by beginning with the 
characterization of normal sequences and proceeding thence, step by step, 
to more general types of sequences. The characterization of a particular 
sequence type is to be conceived as a matter of definition; we cannot ask, 
therefore, whether the “correct” characterization has been found. On the 
other hand, definitions will be constructed in such a manner that they pro¬ 
vide types relevant to practical applications. 

§ 27. Phase Probabilities 

To define the different types of order, some means of structural characterization 
will be constructed by the use of a specific method: we derive certain subse¬ 
quences from the sequence considered, the major sequence , and then investi- 
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gate the probabilities of these subsequences. The method of representing the 
structure of order by statements about the probabilities in subsequences was 
first carried through by von Mises for the purpose of defining random se¬ 
quences. Although I shall construct a somewhat different definition of such 
sequences, I shall follow the same method, in principle, for it turns out to be 
an excellent instrument of structural characterization. I shall, however, extend 
the use of the method to the construction of further types of sequences, fol¬ 
lowing my plan of including, in the theory of probability sequences, the whole 
variety of types of structural order. 

The first of the means of structural characterization is provided by phase 
probabilities. This concept will be developed with the help of the frequency 
interpretation; but it will be shown later that the method applies also in a 
purely formal conception of probability. 

To begin with some practical examples, compare (2, §20) with (1, §26). 
An essential difference between the two sequences is expressed by the rela¬ 
tion of each individual element to its predecessors. In (1, §26) B is always 
followed by S , and S by B; in (2, § 26), however, there is no such dependence 
on the predecessor: sometimes B is followed by B and sometimes by S . The 
type of order can therefore be characterized, at least in one essential feature, 
by stating the probability of an element with respect to its predecessors. 
The normal sequence will then have the property that the predecessors of 
an element have no influence on whether the element is of the kind B or S ; 
in the normal sequence the probability of the occurrence of a specific element 
is independent of its predecessors. 

The idea may be illustrated by an example from a more extended sequence. 
We reproduce in (1) a sequence obtained by throwing one die, writing B for 
an even number and S for an odd number. The sequence is written in such 
a way that the right end of a row is to be continued by the left end of the next 
row, like the lines of a printed page. 

BBBBBBBBBBSSBBSSSSSBB BB BBBS 

BSSSBBSBSBBBSSBBSSBSBSBBSSB (1) 

• » • • • • • • • • • ♦ • 

B BBBBBBBBBBBBBBS BBSSSBBBSS 

Among the 80 elements of this sequence are 41 B’i s, 39 S’ s, so that the prob¬ 
ability \ is reasonably well satisfied. The 41 elements that are preceded by a 
B are marked underneath with a dot. They form a subsequence within the 
major sequence; and the probability, or the frequency, of B within this sub¬ 
sequence is the means of structural characterization to be used. If we count 
out the B -cases among the elements with the dot underneath, we obtain 20 
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as result, that is, the probability of a 5-element in the subsequence is ff, 
which again is approximately = In the same manner the probability of a 
5-element in the subsequence is determined: it is = H ~ £. The probability 
that 5 follows S is calculated by counting 5 in the subsequence of elements 
without a dot (dropping the first element, which has no predecessor); it is 
= ££. Similarly, the probability of 5 following B is found to be if. These 
two values also are nearly equal to The figures express the fact that within 
the sequence under consideration the predecessor does not influence the occur¬ 
rence of a 5-element. In the sequence (1, § 26) we would obtain, within a 
similarly selected subsequence, a probability of 5 different from that in the 
major sequence; here the probability of 5 would be equal to 0 in the subse¬ 
quence selected by 5 as predecessor, because 5 is always followed by B. 
This fact shows the nonnormal character of the sequence (1, § 26). 

Correspondingly, the influence of a combination of predecessors can be 
enumerated. For instance, we can consider in (1) the subsequence of elements 
having two predecessors 5 5; we thus obtain the value ^ for the probability 
that a combination 5 5 is followed by a 5. Similarly, the probability that a 
combination B 5 is followed by a 5 has the value tt. That is nearly \ for 
both again. The latter phase probabilities are independent of the former, 
which refer to only one predecessor, because the last-mentioned figures cannot 
be derived from the values given first, but must be ascertained by a separate 
counting. It is possible that the first predecessor does not produce a selection 
deviating from the major sequence, whereas the second predecessor does so. 
The normal sequence, however, as shown in the example, is characterized by 
particularly simple relations; it satisfies the condition that a selection by 
means of any chosen combination of predecessors results in the same prob¬ 
ability of 5 as exists in the major sequence. 

The concept of probability was introduced for a sequence pair; a sequence 
such as (1) must, therefore, be coordinated to another sequence, so that a 
normal sequence pair results. It is possible to suppress the first sequence in (1), 
because the corresponding sequence consists exclusively of elements e A, 
that is, it is compact —using a term defined in § 16. If this is not the case, and 
the sequence is interspersed , we shall cross out the elements of the sequence 
yi for which x< e A holds, and consider the structural order of the reduced 
sequence only, to which we assign subscripts by the rule 

i ' = N (x k € A) (2) 

jt-i 

This relation coordinates an i ' to each value i of the elements belonging 
to A. With the help of this reduction, the first sequence of a normal sequence 
pair assumes a very simple structure and can be identified with the sequence 
of the subscripts, so that the pair is replaced by one sequence. In this sense, 
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reference may be made to a normal sequence without reference to a pair. The 

same simplification will be used in studying other types of sequences. 

A detailed notation for phase probabilities will now be defined, to be fol¬ 
lowed by a corresponding extension of the abbreviated way of writing. The 
probability of a B following a B will be written in the detailed notation as 

(i) (Xi+i e A .yi e B x e B) (3) 

Here a phase occurs in the subscript. The convention is now introduced that 
this phase, in the abbreviated notation, is added as a superscript to the letter 
specifying the class, so that (3) is abbreviated to the form 

(A l .B -b- B l ) (3') 

In the P-notation we write 

P(A l .B,B l ) = Pl (4) 

A further restriction for the sequences, by which certain mathematically 
irrelevant cases are eliminated, is expedient. Apart from the premise that 
the A-sequence is compact, it will be assumed that the class B, or the com¬ 
binations of the classes B . C . . . G, always occur in infinitely many ele¬ 
ments. In other words, it will be assumed that the probabilities occurring 
represent genuine limits in the frequency interpretation. It then follows from 
this interpretation that we can drop the phase in the term A of (4), since the 
element x, as well as the element x,+i belongs to A (see also § 28). We thus 
write for (4) 

P(A.B,B') = p , (5) 

Such quantities as (5) in w T hich superscripts occur are called phase prob¬ 
abilities. It must not be assumed, however, that the existence of the major 
probability of a sequence insures the existence of phase probabilities; it repre¬ 
sents a certain specialization of the sequence type if we demand that all the 
phase probabilities of the sequence exist. The type of sequence is much more 
specialized if we demand that certain phase probabilities are equal. By means 
of such relations the characterization of the sequence type asked for will be 
given. For instance, the relation 

P(A.B,B l ) = P(A,B) (6) 

expresses the property, illustrated by example (1), that the immediate prede¬ 
cessor has no influence on the probability. 

Before going on with these considerations, I wish to comment on the prin¬ 
ciples underlying the construction of the theory of order. 
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§ 28. Axioms Concerning the Theory of Order 

In the preceding section, the probabilities of subsequences occurring in the 
characterization of normal sequences were interpreted as frequencies. It must 
now be emphasized that this is not necessary; like any other probabilities, 
phase probabilities may be regarded from the point of view of the formal 
probability calculus, that is, may be conceived as undefined quantities of 
which it is known only that they satisfy the axioms of the probability calculus. 
Therefore it is possible to characterize the type 1 of order of a sequence axiomat- 
ically, that is, to give a definition of the type of sequence even in the formal 
calculus. 

To achieve this result it is essential to make a distinction between order 
and frequency of sequences. The requirement that the probability statement 
always concern sequences was incorporated in the logical foundation of the 
probability calculus; therefore, statements can be made about the order of 
the sequences within the formal system. These statements assert the existence 
of certain probabilities for subsequences. That the probabilities of subse¬ 
quences represent limits of the frequency is an additional assertion, which 
need not be stated within the axiomatic formulation of the probability cal¬ 
culus; the assertion is dispensable, as well as for probabilities in general. We 
can therefore regard probability statements that contain terms with super¬ 
scripts as uninterpreted statements, to which, however, the same kind of 
interpretation is to be given as is used for all other probability statements. 
For this purpose the assumption will be made that the letters having a phase 
sign designate objects for which the probability axioms are valid; so it is 
permissible to apply the usual operations and substitutions to letters carrying 
a phase sign. This assumption is not an addition to the axiom system, but it 
docs extend the field of application of the axioms. 

Some additions to the axiom system will be made, however. In the pre¬ 
ceding section we performed an operation with the phase superscript, omitting 
this phase symbol on the letter A. That we are entitled to do so cannot be 
derived from the axioms given in the preceding chapter because those axioms 
do not refer to superscripts. Only from the frequency interpretation can the 
admissibility of this operation be derived. In order to be independent of this 
interpretation we shall introduce two new axioms that allow us to perform 
operations with superscripts in the formal system. These axioms, which refer 
only to infinite sequences, will be called axioms of order. Like the other axioms, 
they are derivable from the frequency interpretation. 

To formulate these axioms we go back to the logical notation of the prob¬ 
ability calculus. We must first formulate a premise of the axioms stating that 
we deal with sequences consisting of an infinite number of elements and re- 
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ferred to a compact sequence A . The latter property can be expressed imme- 
diately by ^ ^ 

because, according to the translation rule, this expression, in the detailed 
notation, means (i)( Xi eA) (1') 

The formulation includes the condition that the number of elements of the 
A -sequence is infinite, for we regard the range of the subscript i as given by 
the positive integers; if the number of A-elements were finite, the subsequent 
elements x x would be A-elements. To formulate the infinity of the P-sequence, 
the abbreviation ( 3 ®/?) is introduced by the following definition: 

(3 00 I*) = d/ (3m)(n)\N(y x e B) < m] ( 2 ) 

•—1 

The right side of the definition states verbally: there exists no number m 
such that, for increasing subscript n, the number of the elements B to the 
n-t.h place remains below m. I 11 an analogous way the condition can be ex¬ 
pressed for combinations: 

(3 G T ) = D f (3 ™)(n)[N(y i+a e B) . . . (g i+T c G) < m] (3) 

t=i 

With the help of this notation, the new group of axioms is written as follows: 
V. Axioms of the theory of order 

1. (A ). (3 00 C) D[(A«.C^fl*) s (A.C-z-B*)] 

V V 

2. (A). (3 co c« . . . 6 V ) 

D [(. A.C a . . . G 9 -a- B t ) s (A.O-p . . . G a ~ p ^ B r ~ p )\ 

P V 

These axioms, in the given order, have the same meaning as the following 
formulas in the P-notation: 

P(A«. Cfif) = P(A . C,B*) (4) 

P(A.O . . . Gr,B*) = P(A .C a ~~ p . . . cr-ejp-o) (5) 

Whereas, however, the implicative formulation of the axioms states the prem¬ 
ises for which the equalities hold, these conditions are omitted in the P-nota- 
tion; a corresponding verbal qualification, therefore, must be understood for 
formulas (4)-(5). Apart from this restriction, it is possible to manipulate 
these formulas in the same way as all the other formulas of the system. For 
instance, in (4) we can substitute A 7 for C and thus derive, by applying (4) 

tWICC ’ P(A“.Ar,Bf) = P{A.A,B*) = P(A,B») (6) 
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Furthermore, in (5) it is possible to replace the terms C a ... GT by the 
factor C V which is always true and may be dropped. Thus we have, 
with (4), 

P(A,B a ) = P(A a ,B°) = P(A,B) (7) 

when we choose the quantity p occurring in (5) equal to a. 

In § 18, axioms i-iv were derived from the frequency interpretation. We 
shall construct a similar derivation for the v-th group of axioms. The deriva¬ 
tion is greatly simplified by the use of F-symbols. The proof is given in detail, 
for we shall use this calculation again, later. First, we write for (4) 


F n (A a .C,B (t ) 


n 

N(x i+a t A). (ZitC). (y i+ 0 < B) 

i- i _ 

N(x i+at A).(z it C) 

i -1 


q» — a 
b„ - a 


a» = N(xi t A ). (z, t C). (y i+ e < B) 

*-1 


K 


N(x if A).(z it C) 

t-1 


a = 0 for a ^ 0 
a = a for a < 0 


( 8 ) 


We now use the identity 

Qn + On 1 

b n -b Vn b n 1 + Vn b n Vn 

b» 

If a n and b n (or only b n ) increase continually, while |5 n ( and |tj n | remain below a 
finite limit for all n, it follows that 

lim = lim - n (10) 

n-. CO bn "I” Vn ft-. CO b n 

We assume that at least one of the two limits exists. Because |{«| = |i; B | = |a| 
is subject to the upper bound |a|, or even equal to 0, and, also, 

F”(A.C,B*) = 


it follows that (4) is valid. 


( 11 ) 
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N(XitA).(z i+a tC) . . . (g i+l ,tG).(yi +T tB) 

t- 1 _ 

N(x if A).(z i+a tC) . . . {g i+ .eG) 

t-1 


On 

+ Vn 


( 12 ) 


a n = N(Xi € A) . {Zi+a-p € C) . . . €(?).0/*+ r - p eB) 

»<—l 

n 

6 n ^ N(Xi € i4) . (2t+ a _ p € C) . . . 6 

t -1 


~i P Vn ^ P 


Because of 

F”(A.C>-' . . . G*->,B r -fi) = ~ (13) 

O n 

and by the use of (10), the derivation of equation (5) is thus achieved. 

From this proof it follows that the axioms v, like the previous axioms, are 
valid for all probability sequences, for they could be derived from the fre¬ 
quency interpretation. They are therefore not restricted to normal sequences 
and will be used for calculations with both normal and nonnormal sequences. 

Reference must be made to the application of the existence rule in com¬ 
bination with the two axioms v. That the rule of existence is valid for the 
frequency interpretation was inferred from the fact that probability equa¬ 
tions correspond to tautological frequency equations that are valid even 
before the transition to the limit (§ 18). It could thus be derived that in the 
transition n o° the m-th probability quantity must approach a limit, if 
limits exist for the m — 1 other probabilities occurring. The situation is 
somewhat different for the axioms v, because they are not valid before the 
transition to the limit, but become tautologies only for the limits themselves. 
It follows that a probability equation derived by the use of the axioms v is 
not strictly valid for the coordinated frequencies; and at first it seems doubtful 
whether we can infer, for the transition to the limit, the existence of the limit 
of the m-th probability from the existence of the limits of the m — 1 other 
probability quantities. But it will now be shown that in this case, too, the 
existence of the m-th probability is determined; in other words, equations 
derived by the help of the axioms v also determine existence. 
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The problem under investigation is made clear by the following considera¬ 
tions. We take the probability equation before applying the axioms v; then 
there corresponds to it a tautologous frequency equation (symbols as in § 18), 

K/r • • • • -f,:j = o (m) 

Here f£ +1 . . . are the frequency quantities that, provided they exist, 
will become equal to one another in the limit n <x> f according to one of the 
axioms v; /*.../“ are other frequency quantities. From (14) we can go 
to the limit equation, constructed without the use of the axioms v: 

r ( j)l . . - Pm , Pm -J-l . - - Pm + s ) =0 (15) 

This equation determines existence just as do other probability equations: 
from the existence of the limit for rn + s — 1 quantities of the probabilities 
occurring, the existence of the limit for the (m + ,s)-th probability can be 
inferred. If the axioms v are now applied to (15), the equation 

r(pi . . . p m , Pm 4 i) = o (16) 

obtains. We do not yet know whether this equation determines existence, 
because no strict frequency equation 


r(/r • 


f n f n ) = 0 
• Jm’Jm. fi/ w 


(17) 


corresponds to it. If, in (15), the existence of the limits is not known for all 
the p m +\ . • • pm+s, we cannot infer from (15) that the limits exist. However, 
assuming that (16) determines existence, we can conclude from (16) that 
p m +i exists. Such an inference, therefore, must be justified by a separate proof. 

For this purpose we derive a property of the frequency equations corre¬ 
sponding to the axioms v. The transition from (9) to ( 10 ), with the help of 
which we inferred the validity of v,l, was based on the assumption that at 
least one of the two limits exists. But we are able to obtain a relation free 
from this assumption when we form the difference of the two frequency 
quantities 

a n a n + 

f n = — f n = - 

1 b n 2 b n + rj n 


We then obtain 


/; - /” 


On 
Vn r 


b n + Vn 


It follows that if b n increases toward infinite values with n, whereas |5 n | and 
j ? 7 n | remain for all n below a definite bound, then 


lim (/ 2 n - /”) = 0 


(18) 
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a n 

Because of ~ ^ 1, it is irrelevant whether a n also increases toward infinite 
o n “ 

values or whether it remains finite. The relation (18) is valid even if the 
separate limits for /* and / 2 n do not exist. In an analogous way, the inference 
for the axiom v ,2 is carried out by interpreting the quantities a n and b n 
as in ( 12 ). 

We make use of this fact for a transformation of (14). We assume that the 
function r occurring in (14) is continuous with respect to all its arguments. 
Subtracting, now, from the arguments /* +2 . . . f* +t the argument/” we 
go from r to a function r f : 

r'(f!‘ ■ ■ ■ fZ’fm + V + 2 • • • 9m+.) ( 19 ) 

Om + i = fvi+i — fm + 1 ( 20 ) 

which also is continuous for all its arguments. If we make the transition to 
the limit n <*>, the g* + i go to the limit 0 because of (18); therefore (19) 
must be an equation that determines existence for the variables /" . . . f£ +l . 
It follows that, if limits exist for the /" . . . /*, a limit must also exist for 

/“ H ; thus, because of (18), limits must exist for the values /* +2 . . . f*+ 8 . 

It is thereby demonstrated that it is compatible with the frequency inter¬ 
pretation to apply the rule of existence to probability equations constructed 
with the help of the axioms v. We shall therefore extend the existence rule 
to the wider probability calculus obtaining when the axioms v are included. 
In other words, all probability equations will be regarded as determining 
existence. 

§ 29. Sequences without Aftereffect 

Phase probabilities are closely connected with the probabilities of certain 
combinations of consecutive elements. Applying the general theorem of multi¬ 
plication, we represent the probability of a combination B B by 

P(A,B B 1 = P(A 9 B) • P(A .B,B l ) ( 1 ) 

Correspondingly, the probability of a combination B B B is written in the 
form 

PiAfi.B'.B 1 ) = P(A,B B l ) • P(A.B.B\B 2 ) 

= P(A ,5) • P(A . B,B l ) - P(A. B . B\B 2 ) ( 2 ) 

Still higher phase probabilities occur for longer combinations. Only in those 
sequences for which all the phase probabilities exist do probabilities exist for 
all combinations of consecutive elements. 

A particular type of sequence is characterized by the condition that all 
phase probabilities of the sequence be equal to the major probability. Such 
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a sequence will be called free from aftereffect. This concept does not completely 
define the normal sequence, but establishes a more general type; however, 
we must study this type first because it is a necessary part of the definition 
of the normal sequence. 

For the symbolic formulation it is useful to introduce, instead of the dis¬ 
junction B V B, the many-term disjunction B x V . . .V B ry which, however, 
must be complete and exclusive. Then we have as definition of the sequence 
free from aftereffect the relation 

P(A.B\ . . . BZ!,Bl) = P(A,Bi.) (3) 

The use of sub-subscripts is a convenient notational device; every total sub¬ 
script i p is a free variable for which any value may be chosen, whereas the 
sub-subscript p indicates the phase of the term to which the chosen value 
belongs. Thus i p means: the subscript belonging to the term occurring with 
the phase p. That (3) is meant to hold for all subscripts and all phase lengths 
v is expressed by the fact that every ?' p as well as v is a free variable. 

The probability of combinations according to (1) obtains a very simple 
form by the use of (3); it is determined by the special theorem of multiplica¬ 
tion. This is proved by using the general theorem of multiplication in com¬ 
bination wdth (3): 

P{A,B\ X . . . B’J = P{A,B\ . . . B’Zt) ■ P(A Bj l . . . Bi~ l ,BU 

- P{A,B\ . . . B’Zl) ■ P(A,Bi :.) (4) 

Applying the same procedure to the first term on the right side, we arrive, 
after v — 1 steps, at the result 

P{A,B\ . . . B’.) = P(A,B U ) . . . P(A,Bi.) (5) 

The sequence without aftereffect, can, therefore, be characterized by the 
statement that it satisfies the special theorem of multiplication wfith respect 
to the succession of its elements. This is true, however, only for a certain 
kind of enumeration. Probabilities like those occurring on the left side of (1) 
or (5) refer, in the frequency interpretation, to a counting operation in which 
the subscript i alw r ays progresses by one unit, so that overlapping combina¬ 
tions are counted. A group B B B then contributes two combinations, which 
may be indicated by the schema B B B; the sequence (1, § 27), for instance, 

would contain 21 combinations. This method of counting is called an enumera¬ 
tion by overlapping segments f or, to use a shorter name, an enumeration with 
overlapping . The sequence (1, § 27) satisfies the special multiplication theorem 
for this type of counting; this can be recognized for the combinations B B 
from the fact that the probability of such a combination assumes the value 
fj, which is nearly =* £ • £. 

The next section introduces a different method of counting. 
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§ 30. Normal Sequences 

It was pointed out above* that the property of being free from aftereffect is 
only a nccexxary, but not a sufficient, condition for the definition of the* normal 
sequence that we wish to construct. In orde r to state the* sufficient conditions 
for the normal sequence, another means of structural characterization must 
be? added to the* use of phase probabilities. The second method consists in 
the use of selection, s*. 

By a selection S we understand a rule that determines for every (dement 
y x whether y, e S or ?/, € S holds. Any probability sequence in which S is an 
attribute of the elements y x can be regarded as a selection; it can be used to 
select from another sequence those elements x x for which y x belongs to S. 
But the* concept of selection is somewhat more general, for we do not require^ 
a limit of the frequency for >S\ Thus the probability P(A,S) need not exist. 
Tlie selection may bo given by an arithmetical rule—for instance, in the form, 
‘‘every third element yP —or, as explained, by a probability sequence. In 
principle, however, we can imagine every selection to be given in the form 
of an infinite schema, which states for every y % whether it belongs to S. 

The method of characterizing the structural order of probability sequences 
by reference to subsequences is thus extended to the use of subsequences 
defined by any form of selection. The probability within a subsequence 
selected by S will be written 

1*(A.S,B) 

The use of general selections can be combined with the use of phase* prob¬ 
abilities; thus in 

P (A. SB,IV) 


the probability of B is considered within the subsequence determined b v y the 
rule that the predecessor of the selected element belongs to S and has the 
attribute B. 

There will always bo selections that leave the probability of B invariant, 
and others that change the probability of B. Let $ be the selection, “every 
third element y t ”. This selection, applied to a sequence obtained by throw¬ 
ing a die, will leave the probability of B invariant; we shall have, for instance, 

P(A.S,B) = P(A,B) P(A.S.BJV) = P(A.BJV) 
P(A.S l .B,B l ) = P(A.BJV) 

A selection changing the probability of B is represented by the selection, 
“every element B of the major sequence and its successor”; in this subse¬ 
quence, B has a frequency higher than in the major sequence. Using, instead 
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of the disjunction B V B, the complete and exclusive disjunction B) V . . . V B r > 
we introduce the definition: 

A selection S belongs to the domain of invariance of the 'probability sequence B 
if it leaves the probability of B x and, simultaneously , all the phase probabilities 
of B{ invariant for all i, whereas S may occur in any phase; that is, if 

P(A.S,B t ) = P{A,B X ) 

P(A.S°.B), . . . B:z!,Bl) = P{A.B], . ■ ■ (1) 

1 ^ a ^ v 


A certain narrower class—the class of regular divisions —plays an important 
part among the selections. By a regular division S we understand a division 
of the major sequence into X subsequences S\i, *S\ 2 , . . . >S\x, such that to a 
specified belong all the elements y x for which 

i = k + (m — 1) • X rn = 1,2,3 ... k = 1, or = 2, or . . . , or = X (2) 

Thus the sequence 3 / 2 , 2 / 6 , 2 / 10 , 2 / 14 , ... is determined by the selection result¬ 
ing for X = 4 and k = 2. In (2), m may be interpreted as the running sub¬ 
script in the “self-numeration” of the subsequence S\ k , that is, a numeration 
in which a series of consecutive numbers is assigned to the elements of the 
subsequence. If the regular divisions belong to the domain of invariance of a 
sequence, we speak of a regular domain of invariance and call the sequence 
regular-invariant. By means of these concepts the desired definition can now 
be given: 

A sequence B is a normal sequence if it is free from aftereffect and if the 
regular divisions belong to its domain of invariance. 

Therefore we have, for a normal sequence, 


P(A .S\ K ,Bi) = P(A,B t ) 

P(A B\, . . . B:r_],Bl) = P(A.B], . . . BiZt.Bl) = P(A,Bi.) (3) 


To show the consequences of this definition, we shall investigate the con¬ 
clusions that can be drawn from (3). When, in the second equation, we use 
the special values v = X, a = 1, and k = 1, and then apply the first of the 
equations with #c = X, we obtain 

P(A. Si. B\,. . . B^\B\) = P(A.Sl,B\) = V (4) 

The probability on the left would read, in the implicational mode of writing, 

(*)([*<« A] . [ 2 /, +1 « SJ . [y i+l « B J . . . [y i+ x _, e B^ ] e- [y i+ x < B { J) (5) 
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Now it is obvious that when the element y»+i belongs to S Xh the element y i+2 
will belong to S x2 , and so on, so that the element y i+x will belong to S xx . The 
condition stated by the compound first term in (5) can therefore be satisfied 
only when the elements y i+h . . . t/»+x-i, 2/*+x, belong, respectively, to the 
first, . . . (X — l)-th, A-th subsequence selected by the regular division. 
This means that we count in (5) only the elements y t+ x that belong to the 
last of the subsequences. A further restriction is that we count only the ele¬ 
ments t/i+x of this subsequence for which the corresponding elements of the 
other subsequences have, respectively, the attributes Bi l . . . B » x _ r The rela¬ 
tion (4), therefore, states that the probability of Bi x in the A-th subsequence, 
given by the term on the right, is independent of the attributes occurring in 
the other subsequences. In other words, (4) expresses the condition that the 
subsequences selected by regular divisions are independent of one another. 

This result can also be proved by the fact that (3) leads to the special 
theorem of multiplication in a form, however, that is somewhat different 
from the one used in (5, §29). It was pointed out in §29 that the latter 
equation is valid for enumeration with overlapping. But the combinations 
can be counted in a different way, by counting consecutive sections of the 
length A that do not overlap. If we thus section off in (1, § 27) by separating 
strokes after every two elements, we obtain 40 sections of two elements each; 
among these we find 8 sections of the form B B. Here again the probability 
is (‘lose to the one we require, namely, is = -fo and thus is nearly = \ • 
This is a different kind of counting, however, which is called enumeration by 
sections. 


The enumeration by sections is formulated by means of the regular divisions. 
When we write m .J, j, . , . 8 y (6) 


this means that we count the combinations B t \ . . . B* x only for the case that 
the element t/,+i belongs to S x j. Since the denominator of this expression in 
the frequency interpretation has the form 


N(xi e A ). (y i+ i € S X i) (7) 

t-1 

it counts directly the number of sections of length A. Thus the combinations 
B B in (1, § 27) will be counted by the expression 

P(A.Sh,B\B 2 ) (8) 

Similarly, we shall count by the expression P(A .S^yB 1 . B 2 ) the sections BB 
resulting after the cancellation of the first element of the sequence. By the 
use of the general term S\ K in (6), instead of Sxi, we thus can express the 
possibility that the sectioning starts after cancellation of the first k~~ 1 ele¬ 
ments of the sequence, for any value k. 
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It is now easy to show that, as a consequence of (3), a probability of the 
form (6) satisfies the special theorem of multiplication. Using the general 
theorem of multiplication, we have: 

P(A .SLjA ■ . ■ B\) = P(A.Sl,Bi . . . tfz!) 

■ P(A.SL ■ Bl, . . . B)-],B\) 

= P(A.Sl,B}, . . . Bt'l) ■ P(A,B, X ) (9) 

1 ^ K g X 

Here we have applied (3) to the term on the second line. Repeating the 
procedure, we obtain 

P(A.SL,B), . . . <“J) = PiA.SLB), . . . tf*~f) 

■PiA .Sl.Bl. . . BXLbX') 

= P(A .SL,B] t . . . BXX • P(A,B, X ,) (10) 

The last step, once more, is covered by (3). By further iteration of the pro¬ 
cedure we arrive at the result 

P(A ,SL,bI . . . B\) = PiA^u) ■ ... ■ P{A,B , x ) (11) 

Thus we have shown that normal sequences satisfy the special theorem of 
multiplication for counting by sections. This result goes beyond (5, § 29) 
because the latter relation refers only to counting with overlapping. That 
normal sequences satisfy also (5, § 29) follows because the definition of these 
sequences includes the condition that they be free from aftereffect. 

The normal sequence was defined, not by specifying its total domain of 
invariance, but by stipulating only a certain condition for the domain of 
invariance. There is an important reason for such a procedure: it is virtually 
impossible to specify completely the domain of invariance of a sequence. The 
normal sequences dealt with in practical applications possess, in general, a 
domain of invariance wider than that stipulated here. But whether a certain 
selection S belongs to the domain of invariance of a sequence is a question 
that must be investigated for each selection separately. The answer to the 
question is often an important task of empirical science. 

For example, we form the sequence of adult male inhabitants (case ^1) 
of a metropolis, say, according to the order in which their names are given 
in the city directory (the alphabet then defines the subscript sequence); and 
we note whether or not the person suffers from tuberculosis (case B). Then a 
normal sequence is obtained for B , since here a selection by predecessor or 
by regular division does not result in a different frequency. Should we make 
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a selection, however, by considering only the inhabitants of certain quarters 
of poor housing conditions, we would obtain a higher frequency for B. In 
this case the major sequence contains a different selection; but it will corre¬ 
spond, not only to the definition (3), but to the ordinary linguistic usage if 
we nevertheless call it a normal sequence. 

A sequence is completely and individually characterized only by stating 
its domain of invariance; but whether it is a normal sequence must be deter¬ 
mined on the basis of a general definition. For this reason, the definition of 
the normal sequence is given in terms of a minimum condition for the domain 
of invariance. The regular divisions were chosen for this purpose because they 
are connected with the validity of the special theorem of multiplication in 
enumeration by sections. Moreover, the domain of regular divisions can be 
applied usefully to some nonnormal sequences also. 

This definition of the normal sequence corresponds to one given by A. IT. 
Copeland, 1 who defines the normal sequence by the postulate of the invari¬ 
ance of the limit of the frequency in subsequences selected by regular divisions, 
with the use of an additional postulate requiring the complete mutual inde¬ 
pendence of the subsequences. 2 It can easily be shown that Copeland's defi¬ 
nition is equivalent to the one given above, for which the independence of 
the subsequences is a derivable theorem, whereas the invariance condition is 
extended to include phase probabilities. Furthermore, Copeland has been 
able to show that it is possible to construct normal sequences by means of 
mathematical rules—a highly satisfactory result, which gives the theory of 
normal sequences a secure; position within the mathematics of infinite 
sequences. 

The various forms of definitions of normal sequences were preceded by a 
definition of random sequences established by R. von Mises. His theory was 
the first to characterize the structural order of sequences by means of postu¬ 
lates concerning the; limit of the frequency in subsequences. The specific 
postulate used, however, differs from the one used for normal sequences in 
that it establishes much stronger requirements. 

The theory developed by von Mises can be summarized as follows. The 
notion of place selection is defined as a selection by any rule that does not 
make use of the attribute of the element selected, though reference may be 
made to attributes of other elements. The regular divisions are examples of 

1 ^Admissible Numbers in the Theory of Probability,” in Amer. Jour. Math., Vol. 50, No. 
4 (1928), p. 535. Instead of my term “normal sequences”, Copeland uses the term “admis¬ 
sible numbers”. He uses as attributes the numbers 1 and 0 instead of B and B, and thus re¬ 
gards every sequence as a dyadic fraction defining a number. Copeland’s publication pre¬ 
cedes my own first publication of my definition of normal sequences (1932) by several years; 
but it was unknown to me at that time. 1 am glad to recognize his priority with respect to 
the definition of normal sequences. 

2 1 used this definition in my first publication on this subject: Math. Zs ., Vol. 34 (1932), 

E . 603.1 prefer the definition given in this book (and also in the German original of this book) 
ecause it can be extended to the case of certain nonnormal sequences (see § 33). 
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place selections. Other illustrations are given by the selections, “every element 
the subscript of which is a prime number”, or “every element that has an 
element B as predecessor”. But the selection, “every element that has the 
attribute B” , is not a place selection, because it makes use of the attribute B 
of the element selected. 

Von Mises now introduces the postulate that any place selection that is 
definable by a rule belongs to the domain of invariance. This postulate is 
called the principle of randomness; a sequence thus defined is called a collective. 
He further shows that a collective defined in this manner can never be given 
by a mathematical rule. He intends thus to formulate the peculiar feature of 
randomness that is exhibited by sequences produced by events in nature. In 
particular he wishes to express the fact that it is impossible to construct a 
gambling system, that is, a method of selection by which a gambler will, on 
the whole, gain money. 3 

The definition of randomness given by von Mises has been widely discussed. 
In particular, the objection was raised whether his principle of randomness 
is free from contradictions. It is questionable whether the phrase, “a place 
selection that can be defined by a rule”, determines a unique class of selec¬ 
tions. The class of selections so determined will depend on the language used; 
what is undefinable for one language may be definable for another. 

In order to overcome this difficulty, two different ways of defining random¬ 
ness can be used. The first approach is to restrict the class of selections to a 
well-defined class; this method leads to a restricted randomness. The plan of 
defining a logical randomness , that is, of identifying randomness with the 
impossibility of a linguistic formulation of deviating selections, is then aban¬ 
doned. A restricted randomness is defined in terms of a restricted domain of 
invariance. Of this kind are the definitions of normal sequences given by 
A. H. Copeland and the author; the theories developed by K. Dorge, 4 A. Wald, 6 
and Jean Ville 6 belong in the same group. The latter theories differ from the 
former in that the domain of invariance is defined differently. Thus Dorge 

3 The first publication by von Mises of his theory was made in Math. Zs., Vol. V (1919), 
p. 57. In this publication he stipulated the further condition that other collectives—that is, 
other sequences produced by nature, if taken as selections—should belong to the domain of 
invariance of the sequence. This condition was dropped later—see his Vorlesungen aus dem 
Gebiete der angewandten Mdthematik, Vol. I: W ahrscheirdichkeitsrechnung . . . (Leipzig, 1931), 
p. 12—and a differentiation between dependent and independent collectives (ibid., p. 93) 
was carried through. My objections to the principle of randomness—“Axiomatik der 
Wahrscheinlichkeitsrechnung,” in Math. Zs., Vol. 34 (1932), p. 594—were, in park directed 
against this earlier formulation, but they also expressed the criticism given here. It may be 
emphasized again that I consider the foundation of the probability calculus given by von 
Mises as a great advance in the construction of the calculus. Only the principle of random¬ 
ness seems to me to be a condition that requires revision. (See the remarks below.) 

4 Math Zs., Vol. 32 (1930), p. 232. 

* Die Widerspruchsfreiheit des Kollektivbegriffs, Ergebnisse eines mathematischen Kollo - 
quiums (Vienna), Vol. VIII (1937). 

6 “Etude critique de la notion de collectif, ,> in Monographies des probability (ed. by Emile 
Borel; Paris, 1939). 
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used a definition that begins with a certain set of selections and then includes 
in the domain of invariance all iterations of these selections, that is, all selec¬ 
tions that can be constructed by any combination of a number of these 
selections. Wald introduced the notion of a collective relative to a certain 
class of selections, and then investigated the class of sequences that can pos¬ 
sibly satisfy such a definition. His results are particularly interesting when 
sequences with a continuous attribute space (see § 41) are used. The investi¬ 
gations of Ville concerned, in particular, the question whether it is possible 
to construct a gambling system for a collective as defined by Wald. Ville’s find¬ 
ings were that the exclusion of a gambling system cannot be postulated uni¬ 
versally, but must be formulated with respect to a class of systems of wagering. 

The method developed by Copeland and the author differs from all the 
others in that it uses only certain minimum conditions for the domain of 
invariance and leaves it open whether further kinds of selections are to be 
included. This method has the advantage that, when a certain theorem is to 
be derived, no more presuppositions are used than are necessary for the 
derivation. Thus, when the validity of the special theorem of multiplication 
in enumeration by sections is te> be* proveel, it is irrelevant whether the se¬ 
quences possess more invariant selections than those specified for a regular 
domain of invariance. For instance, it is irrelevant whether the subsequence 
e>f (dements the place number of which is a prime number has a different limit 
of the frequency. 

The second approach takes up the challenge of defining a logical random - 
ness, but attempts to avoid the difficulties of the original definition given by 
von Mises. In this group belong certain remarks made by Wald. 7 In particu¬ 
lar, however, the definitions developed by Alonzo Church 8 must be men¬ 
tioned. Using his notion of effective calculubility , developed in continuation of 
Coders theorem about certain limitations of deducibility, 9 Church arrived 
at a definition of a random sequence that may perhaps be regarded as the 
solution of the problem, that is, as a definition of randomness in terms of 
pure logic. It is certainly a remarkable result that such a definition can be 
given. The complexity of the subject makes it impossible to explain here the 
methods used by Church, which refer to some of the most intricate develop¬ 
ments of symbolic logic. 

The significance of randomness definitions of the latter kind is chiefly in 
the realm of logic and mathematics. So far as the applied calculus of prob¬ 
ability is concerned, all requirements can be satisfied by simpler methods. 
The methods used in the definition of normal sequence are sufficient as a 
basis f or the mathematical treatment of randomness used in practical appli- 

7 A. Wald, op. cit., p. 47. 

8 Bull . Amer. Math . Soc., Vol. 46 (1940), p. 130. 

8 Compare the presentation of this theorem and of Church’s methods in D. Hilbert and P. 
Bernays, Grundlagen der Mathematik (Berlin, 1939), Vol. II, § 5, and Supplement II. 
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cations. If the term randomness is to include further requirements, they can 
be stated by the use of selections of physical reference —selections not defined 
by mathematical rule's, but by reference to physical (or psychological) occur¬ 
rences. For practical statistics it is as important to know which physical 
selections belong to the domain of invariance as it is to know which mathe¬ 
matical selections are contained in this domain. If a sequence possesses ran¬ 
domness of the von Mises-Ohurch type, there may still be physical selections 
that lead to a deviating frequency. This may be illustrated by the previous 
example of tuberculosis in a metropolis and a selection in terms of living 
quarters. 

Now it is possible to formulate an equivalent of von Mises’ condition of 
randomness in terms of physical reference, instead of the use of logical meth¬ 
ods. Random sequences are characterized by the peculiarity that a person 
who does not know the attributes of the elements is unable to construct a 
mathematical selection by which he would, on an average, select more hits 
than would correspond to the frequency of the major sequence. In other 
words, such selections will be included in the domain of invariance. In this 
form, the impossibility of making a deviating selection is expressed by a 
psychological, not a logical, statement; it refers to acts performed by a human 
being. This may be called a psychological randomness. 

For all practical purposes the psychological definition of randomness is 
sufficient. It has the advantage of being free from problems of consistency, 
and does not connect the calculus of probability with controversial problems 
of the theory of logical deduction. It is true that a formulation of this kind, 
instead of speaking of logical impossibilities, refers only to a limitation of 
the technical abilities of human observers. But such a psychological reference 
is indispensable, too, when selections in terms of physical observations are 
to be incorporated in the domain of invariance. We say, for instance, that 
observations of the initial velocity of the spinning roulette ball do not enable 
us to make a deviating selection of the results and that thus a selection based 
on such observations belongs in the domain of invariance of the sequence. 
But this is true only in view T of the limited abilities of human observers, as 
far as both observation and mathematical computation are concerned. With 
precise observation of the initial velocity of the spinning ball, it should be 
possible to foretell with any degree of exactness where it will come to rest. 
It is only lack of technical ability that prevents us from equaling Laplace’s 
superman. Once the velocity of the spinning ball has died down noticeably, 
such a computation comes into the scope of technical possibilities. This is 
well known to the owners of gambling places, who stop such attempts by the 
croupier’s call, “Rien ne va plus”. 

The significance of the problem of the definition of random sequences 
should not be overestimated, however. Within the general calculus of prob- 
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ability, random sequences represent merely a special type; as in all other 
cases, the definition of special types is more or less arbitrary and is subject 
only to considerations of practical use. In actual applications, all kinds of 
probability sequences are encountered. Some show the features of random¬ 
ness; others represent intermediate types between strictly ordered and random 
sequences. In the following sections, such nonnormal sequences will be con¬ 
sidered. It would constitute a rather narrow conception of probability if the 
name of probability sequences were reserved for random sequences—a con¬ 
ception that certainly would contradict language usage. For instance, the 
sequence of the daily number of traffic accidents in a city will not represent 
a normal sequence because the seven-day period of Sundays will furnish a 
subsequence of a different frequency; nonetheless, this sequence will be re¬ 
garded as a probability sequence. All types of probability sequences are 
found in nature. A mathematical theory of probability should not be restricted 
to the study of one specific type of sequence but should include suitable 
definitions of various t}'pes, chosen from the standpoint of practical use. 

It is an important fact that all these special types can be dealt with by 
the same general system of axioms that was presented in chapter 3 and § 28. 
The specific conditions of a certain type can be formulated as equalities refer¬ 
ring to certain probabilities of subsequences. All that we need, therefore, for 
the treatment of special types is a knowledge of whether certain probabilities 
are equal; the rest is derivable by the general rules of the calculus. If it is 
possible to ascertain the values of the probabilities of sequences, it will be 
possible also to know whether two probabilities are equal. The determination 
of the specific types of probability sequences occurring in practical applica¬ 
tions can therefore be carried through by the same methods as those that 
allow us to determine whether a sequence is a probability sequence at all. 

§ 31. Some Numerical Problems Referring to 
Normal Sequences 

The characterization of normal sequences given above, though simple meth¬ 
ods are used throughout, may appear rather complicated when regarded 
abstractly in its symbolic notation. A few additional considerations and exam¬ 
ples will serve to elucidate the bearing of the special theorem of multiplication 
on the structure of the sequence. 

We shall examine the question how often a die must be thrown so that the 
occurrence of face G in at least 1 of the throws can be expected with the 
probability We may use here the relation (5, § 20). We regard the occur¬ 
rence of the “6” as the event B. Because of the logical equivalence 

(J3 1 V . . . V J3* = ~BK . . . W) 


(1) 
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and with the use of the special theorem of multiplication, the desired prob¬ 
ability obtains in the form 

P(A,B 1 V . . . MB") = 1 - P(A,7T l . . . . W) = 1 - (1 - IY (2) 
If this probability is to be = \ y we have the equation 

i = 1 — (1 — \Y (3) 

which, when solved for n, gives the value n = 3.8. Since n can be only an 
integer, this result means that the occurrence of face 6 cannot be expected 
for 3 throws with a probability \; but for 4 throws it can be expected with 
a probability greater than We can ask the same question for the cast' that, 
throwing with two dice, we obtain doublets, for example, the combination 
6 . 6 ; we then substitute in (3) -gV for l and obtain, solving, for n> the value 
n = 24.6. Only after at least 25 throws, therefore, can we expect 1 the double 
“ 6 ” with a probability greater than 

Jacob Bernoulli showed that these numerical values lead to a paradox 
when treated without due caution . 2 Imagine a series of 600 throws carried 
out with one die; divide the series into sections of 4 throws each. The value 
3.8 can be replaced approximately by 4, and thus we may expect a “ 6 ” in 
every second group. There are 150 groups, and therefore we should find 
face 6 about 75 times. This contradicts another consideration: since the prob¬ 
ability of obtaining face 6 is |, we should expect “ 6 ” 100 times within 600 
throws. How can the contradiction be explained? 

The error does not result from the approximative replacement of 3.8 by 4; 
the small inaccuracy cannot account for so great a deviation. It lies rather 
in the careless manipulation of the inclusive “or”. What follows from (3) is 
that among 4 throws at least one “ 6 ” is to be expected with the probability 
Therefore, on the average, a “ 6 ” will occur in every second section; but 
some sections will have several values “ 6 ”, so that altogether there are 100 
results of “ 6 ”. This makes clear a peculiar consequence to which the special 
theorem of multiplication leads, so far as the structural order of normal 
sequences is concerned. The theorem requires that a certain clustering of the 
results must occur. When the series is divided into sections of 4, there will 
be, on the whole, no “ 6 ” in every second section; but in the other sections 
the “ 6 ” will be found accumulated. Should we distribute the 100 values “ 6 ” 
artificially in an even density over the 600 throws, we would not obtain the 
normal frequency distribution. The random distribution differs from a dis¬ 
tribution by artificial equalization; it gives not only the statistical frequency 
required fo r the major probability , but, simultaneously, those required for the 

1 This example played an important role in the history of the probability calculus. It 
represents the question that the Chevalier de M6r<5 asked Pascal and that stimulated this 
mathematician to give the first scientific treatment of the probability calculus. 

2 Ars conjectandi (Basel, 1713), Part 1, chap, x, p. 29. 
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many minor 'probabilities that hold for the possible combinations of results. 
Persons not acquainted with mathematics often fail to understand this 
feature of probability sequences and are astonished at the clustering that 
occurs. The phrase, “Like attracts like”, was coined in order to account for 
runs of pleasant or unpleasant happenings. If a person not trained in the 
theory of probability were asked to construct artificially a series of events 
that seems to him to be well shuffled, there would not be enough runs in it. 
(It would be a sequence with subnormal dispersion; see § 52.) Only if the 
sequence is not normal can the clustering be stronger than is compatible with 
the special theorem of multiplication; tests of the normal character of a 
sequence are, in fact, often based on the counting of clustering. Normal 
sequences are characterized by normal clustering. Here the clusters are not 
very long, and the clustering disappears when larger sections arc considered. 

In this connection should be mentioned another paradox to which careless 
inferences may lead. According to the multiplication theorem, long runs, 
that is, consecutive occurrences of the same event, are comparatively rare. 

The probability of a series of 5 results of “6” is equal to ~ which 

( 3 ° 7 , *70 

is a very small number. If such a series once occurs, many people are inclined 
to believe that the occurrence of a further “6” is now more improbable than 
in other cases. They argue that the appearance of another “0” would produce 
the very improbable case of a series of 6 results of “6”, to which only the 

probability “ 8 = ^ can be assigned. But this statement contradicts an¬ 
other consideration. In every throw, the probability of the “6” must be 
because the preceding throws cannot influence the later ones. The paradox 
is resolved by recognizing that, for instance, the occurrence of a “5” in the 
sixth throw makes the present 5 values “6” into a sequence 6-G-G-G-G-5, the 

probability of which is likewise only “ 6 . Thus it is clear that after a series of 

5 values “6” there are no different conditions for the “6” than for any other 
number. Therefore the statement is correct that even in this case the original 
probability \ will hold for the occurrence of the “6”. Every series that actu¬ 
ally occurs is extraordinarily improbable, since its probability is calculated 
as the product of factors smaller than 1. If a series of G values “6” is regarded 
as particularly improbable, this will be correct when it is compared, not to 
any definite different sequence, but to the case of any other sequence, that is, 
to an or-combination of other sequences. But if 5 values “G” once have 
occurred, a great number of possible cases is therewith excluded; and there 
remain only 6 cases given by the occurrence of a 1, or 2, . . . or 6 for the sixth 
throw. Thus the situation is exactly the same as in the first throw. 

Apparently it is not easy to acquiesce in this conclusion. At the “green 
table” it can frequently be observed that, after the occurrence of a long run, 
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the gambler is inclined to believe in the prospect of a change rather than in 
the continuation of the run. It requires some theoretical training not to sub¬ 
mit to the suggestive power of this fallacy. The case may be compared to 
optical illusions, which we know to be misleading without being able to free 
ourselves from the compulsion of the optical image. 

§ 32. Mutually Dependent Normal Sequences 

The definition of the normal sequence in § 30 is given in terms of an internal 
property of the sequence; it states nothing about the relation of the normal 
sequence to other sequences. Thus two normal sequences may be mutually 
dependent. In this case the special theorem of multiplication holds with 
respect to the succession of elements within each sequence, but not for the 
combination of the two sequences. This makes clear that the special theorem 
need not hold true for all combinatory properties of sequences, but that there 
will be cases in which the special theorem is used for certain operations, 
whereas the general theorem of multiplication is used for others. 

A trivial example for the case considered is given by two normal sequences 
that are exactly equal. If, in throwing with two dice, the same face should 
always appear on both dice, each of the sequences could be normal, never¬ 
theless. Among the combinations, however, doublets would have the prob¬ 
ability |, but the combination of unequal faces would have the probability 0, 
a result that obviously contradicts the special theorem of multiplication. 
This extreme case is of little interest because it does not occur in actual prac¬ 
tice. A different example will therefore be supplied. The explanation of the 
physical mechanisms by which such sequences can be produced will be post¬ 
poned to the end of the section. For the moment we consider only the struc¬ 
ture of the sequences given: 

BBBBBBBBBBBBBBBBBBBBBBBBB 
CCCCCCCCCCCCCCCCCCCCCCCCC 

BBBBBBBBBBBBBBBBBBBBBBBBB 

CCCCCCCCCCCCCCCCCCCCCCCCC 

_ ... . ( 1 ) 

BBBBBBBBBBBBBBBBBBBBBBBBB 

CCCCCCCCCCCCCCCCCCCCCCCCC 

BBBBBBBBBBBBBBBBBBBBBBBBB 

CCCCCCCCCCCCCCCCCCCCCCCCC 

The two sequences, consisting of 100 elements each, are to be read continu¬ 
ously from left to right; the left end of a row of the 5-sequence is to be con¬ 
nected to the right end of the preceding row of the 5-sequence. The 5-sequence 
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contains 51 B'i s, 49 B 1 s; the C-sequence has 49 C’ s, 51 C’s; the probability, 
therefore, is virtually equal to That the sequences are normal may be 
shown by enumeration of the selection given by the first predecessor. Among 
the 51 elements of the ^-sequence having the predecessor B are 26 B' s, 25 B’i s, 
so that the probability \ is reasonably well satisfied. Among the 48 elements 
with the predecessor B are 25 B’ s, 23 /5’s, which also corresponds well to the 
probability §. The situation is similar for the (7-sequence. Among the 49 
elements with the predecessor C are 23 C’s, 20 C’s. Among the 50 elements 
with the predecessor C are 20 C’s, 24 C’s. Thus the probability \ obtains 
everywhere in good approximation. I do not wish to investigate the normal 
character further and am satisfied with the result that the sequences at least 
have the property of being normal with respect to the first predecessor. 

However, we see at a glance that the frequency of the combination B.C 
is not determined by the product \ • \ = {; on the contrary, the combina¬ 
tions B.C as well as B.C are more frequent than the others. We count 
33 B.C , 18 B . C, 10 B.C, 33 B.C. In comparison with the total number 
of 100 elements for each sequence, the frequency of the combinations B.C 
and B.C thus results as about J; the frequency of combinations B.C and 
B.C as about J. If we introduce the corresponding probabilities within the 
subsequences instead of using the probabilities of the combinations, we have 


as a result 


P(A,B) = -r',Ar ~ * 

P(A,C) = 

4 9 1 

TOO ~ 2 

P(A.B,B l ) = If 

P(A.B,B ') = 

.2 5 ^ 1 
4 8 2 

P(A.C,C *) = if 

P(A.C.C') = 

2 3 _ . 1 
'S O ~ 2 

P(A.B,C) = ff ~§ 

P(A.C,B ) = 

.3 3 ^ 2 
1 9 3 

P(A.B,C) = H- 

P(A.C,B) = 

1 8 1 
ST ~ 3 


( 2 ) 


A stands here for the first term, the reference class, which is compact for 
the two sequences and is not written down in (1). The first three lines of (2) 
correspond to the properties of normal sequences; the last two lines express 
a mutual dependence of the two sequences. Consider, for instance, a roulette 
wheel the inner part of wdiich is covered so that the ball can come to rest 
only on the outer part of the sectors. If we use two balls connected by a 
piece of string that is much shorter than the width of the sectors, the balls 
will usually stop on fields of the same color. Only occasionally will they 
come to rest so that the string crosses the boundary of two sectors. The 
resulting sequences will be of type (1). The individual sequences will be nor¬ 
mal, but if one ball lies on “black”, the other ball will usually come to rest 
on “black” also, and the corresponding result will hold for “red”. 

In this example the probabilities have been chosen so that the disjunctions 
B V B and CvC possess equal probability for each of their terms. The pre¬ 
ceding considerations are, of course, independent of this special case. But the 
case must be studied more closely because of its practical importance. 
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Consider the disjunctions Bi V . . . V B n Ci V . . . V C r , which contain the 
Bi and Ck as equally probable cases. The fact that the special theorem of 
multiplication holds for combinations of these terms leads to an analogous 
equiprobability for all possible combinations. The mutual independence of 
the terms of such disjunctions is identical with the equiprobability of all com¬ 
binations Bi.Ck . This can be derived immediately from 1 


P(A,B0 = . . . = P(A,B r ) = P{A,C X ) = . . . = P(A,C r ) (3) 
P(A,Bi.C k ) = P(A 9 Bi) • P(A,C k ) (4) 


It is of interest to note that another assumption can be used for the com¬ 
bination of terms, which can also be regarded as expressing an equiprobability 
of combinations, though in a different sense. In (4) the probability of a com¬ 
bination Bi.Ci is assumed as equal to the probability of a combination 
Bi.Ck ; the probability of a disjunction Bi.Ck V Bk-C% is then twice as large. 
If we throw with two dice, for instance, the probability of obtaining face 
6 with both dice is equal to -st; the probability of obtaining “5” with the 
first die and “6” with the second is likewise equal to If we speak of the 
combinations 6.6 and 5.6, respectively, without specifying on which die each 
individual face shows up, then the combination 5.6 is twice as probable as 
the combination 6.6; for 5.6 can result in two different ways, which may be 
symbolized by Z? 5 -C« V B 6 .C 5 , whereas 6.6 has only the form Bg.Cc. Com¬ 
binations not specifying to which individual die the term appertains may be 
called nonindividualized combinations . When such combinations are used, no 
attention is paid to individual differences between the combinations Bi.Ck 
and B k .Ci. Now it is possible to carry through the assumption that non¬ 
individualized combinations constitute equiprobable cases. This idea, which, 
of course, leads to a different kind of statistics, deserves some further investi¬ 
gation. 

The assumption is satisfied by the conditions 


P{A,B.) = l 
PiA.B^Ci) = 

1 


P{A,C k ) = \ 


P(A.Ci,Bi) 


2 

r + 1 


1 


(5) 


P(A.Biflu) = —: i *k P{A.Ck,Bi) = —: t * k 
r -+- i t l 


1 The combinations will also be equally probable when the two disjunctions have different 
numbers of terms. I shall not discuss this case here, however. 
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These equations formulate the equiprobability of nonindividualized combi¬ 
nations, as can be seen from the relations 

P(A,Bi.C k V B k .Ci) = P(A,Bi.C k ) +P{A y B k .C % ) 

= P{A ) B i ) • P(A .Bifik) + P(A,B k ) • P(A .B k ,C x ) 

1 1 1 1 2 
r r+l~^r r+1 r(r + 1) 

PiA^Ct) = P{A y B x ) • P(A.Bi,Ci) = \ • ^ (66) 

For the case of the die, we would then have the value y for the probability 
of obtaining a “6” on the second die when the first shows a “6”; we would 
have the value y for the probability of obtaining a “5” on the second die 
when the first shows a “6”. The combinations Bi.C k are here only half as 
probable as the combinations so that equiprobability obtains only 

for nonindividualized combinations. We thus have the probability for 
every nonindividualized combination, that is, for a combination 6.6 as well 
as for 6.5. 

It remains to demonstrate that equations (5) are free from contradictions. 
Such proof is necessary because the assumption that nonindividualized com¬ 
binations are equally probable contains overdeterminations. Now the proof 
is easily given. We see that the relations 

E PiA ■ B„c t ) = 1 T,P(A. C k ,B t ) = 1 ( 7 ) 

Jfc-1 »-l 

follow from (5); this condition is required because of the completeness of the 
disjunction. Furthermore, we derive from (5), using the elimination theorem 
(21, § 19), 

P{A,C k ) = E P(A,B,) ■ P(A.B it Ck) 

*-1 

= P(A,B X ) ■ £ P(A.B it Ck) = P{A,B X ) (8) 

Here we used the relation 

±P(A.B it Ck) = 1 (9) 

i-I 

which follows from (5). Although formula (9), in general, is not a necessary 
condition, the probabilities P(A.B iy C k ) being nonbound probabilities, it 
must be required in this case because otherwise (8) would lead to contra- 
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dictions. The result (8) shows the admissibility of the conditions given in (5) 
for the P(A,C k ). Finally, we derive from the rule of the product (6, § 12) that 

P(A . C k ,B ,) = • P(A. B,,C k ) = P(A . B u C k ) ( 10 ) 

The last step follows because of P(A y Bi) = P(A,C k ). Thus the remaining 
part of equations (5) is also proved to be free from contradictions. 

This concludes the proof that it is logically admissible to assume that non- 
individualized combinations are equally probable. The assumption, however, 
entails a mutual dependence of tin' terms B x and C k ; for (5) contradicts 
(4, § 14). This is of interest, for, in recent times, quantum theory has made 
use of such an assumption in the Einsicin-Bosc statistics. The given analysis 
shows that the equiprobability of nonindividualized combinations, assumed 
in this statistics, must be interpreted as a physical dependence. It is scon, 
furthermore, that it is pointless to adduce a priori reasons with the intention 
of deciding whether the individualized or the nonindividualized combinations 
should be regarded as equiprobable. All that can be asked is whether the 
events considered are dependent or independent; and this is an empirical 
question. 

The relations (5) enable us to construct physical mechanisms that supply 
sequences in which the nonindividualized combinations are equally probable. 
For the sake of simplicity we use a disjunction of only two terms. The condi¬ 
tions (5) can then be realized if we have a mechanism for the probability 
another for the probability and a third for the probability f. For the first 
mechanism a die may be chosen when we use only the results even and odd. 
The other two mechanisms can be constructed by means of colored dice 
without eyes and with the following properties: one die has two black faces 
and four white ones; the other, four black faces and two white ones. Call 
the first the white die; the other, the black die. For the white die the prob¬ 
ability of obtaining white equals §, and the probability of black equals 
for the black die the situation is the reverse. 

We proceed with the game as follows. Throw first with the numbered die 
and note the result as B if an even number appears, as B if an odd number 
appears. Then throw with one of the colored dice according to the following 
rule: if the result- of the numbered die was B, throw with the black die; if 
the result of the numbered die was 5, use the white die. The appearance of 
a black face may be designated by C. After each set of two throws, consisting 
of one throw with the numbered die and one with a colored die, a new set is 
played according to the same rules. In this manner we coordinate to every 
B and B of the numbered die a corresponding C or C. It follows at once from 
our knowledge of the properties of a die that the sequence of the B and B 
thus obtained is normal. That the sequence of the C and C y obtained indi- 
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rectly, is also normal, follows from (8); each individual element of this sequence 
is independently played for, with the probability when we regard as the 
starting point the situation before the throwing of the numbered die. In this 
manner example (1) was actually produced. It therefore illustrates the equi- 
probability of nonindividualized combinations. If we count from this view¬ 
point, we find in (1) 33 combinations B . ( 7 , 33 combinations B. C, and 34 
mixed combinations B . C V B. C. We sec that the cquiprobability of the 
combinations is satisfied reasonably well. 

These considerations show that the order of the individual sequence is 
independent of its relations to other sequences. The frequency within each 
individual sequence of (1), as well as its internal order, is not influenced by 
the fact that homogeneous combinations have' a higher probability than mixed 
ones, taken individually. Dependence is expressed solely by a relation between 
the sequences, not by properties of individual sequences. The technique of the 
probability calculus, therefore, enables us to formulate dependence as a rela¬ 
tion between sequences without submitting the individual sequence to any 
special conditions in regard to structure. 

The' example can be used to illustrate 4 another feature exhibited by the 
relation of the dependence 4 of sequences. It was shown above (§ 23) that the 
independence of sequences is not a transitive relation. It is now easy to show 
by an example 4 that even the normal character of sequences does not change 
this result. If we add te> the sequences B and C in (1) another die sequence D, 
obtained by playing inelepenelently with another die, noting down an even 
result as D and an odd result as D, then B is independent of D, and D is inde¬ 
pendent of C\ However, as shown in the example, B is not independent of C. 

§ 33. Probability Transfer 

We now turn to the analysis of nonnormal sequences and thus of cases where 
the probability of an element depends on its predecessors. The simplest and 
most important type is a sequence in which the immediate predecessor alone 
determines the probability of an element, whereas the other predecessors are 
irrelevant. Such sequences were first studied by Markoff and are often called 
Markoff chains. I shall use the name of sequences with probability transfer. 

These sequences may be illustrated by an example, constructed with the 
help of the white and black dice described at the end of § 32. Starting with 
any of the two dice, we observe the following rule: if black occurs, the next 
throw is made with the black die; if white appears, the white die is used for 
the following throw. This is continued; black and white are taken as B and B 
within one sequence. In general, when a certain result has been obtained, the 
probability of the same result in the next throw equals f; that of a different 
result is equal to The probability transfer has here the character of a drag; 
there exists a tendency to stay, and the change in events is delayed. 



160 


THE ORDER OF PROBABILITY SEQUENCES 


The opposite case is obtained when we reverse the rule: if black occurs, we 
proceed to play with the white die; if white appears, we play the next throw 
with the black die. The probability transfer possesses here the character of a 
compensation; there exists a tendency to alternate, and the change in events 
is speeded up. 

It is obvious that, on the average, the game will equal a game played with 
the probability such as obtaining for a die with three white and three 
black faces, or for heads and tails of a coin. The order of the sequence, how¬ 
ever, will be different, because it is not determined by the special theorem 
of multiplication, but in a more complicated manner. 

Two examples of sequences obtained by the procedure described are given. 
Example (1) is a sequence with drag, (2) a sequence with compensation. 

BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 
BBBBBBBBBBBBBBBBBBBBBBBBBEBBBB (1) 
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 

bbbbbbbbbbbbbbbbbbbbbbbbEbbbbb 

BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB (2) 
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 

EbbbbbbbbbbbbbbbbbbbbbbbbbbBbb 

It may be seen at a glance that probability drag, as represented by (1), is 
characterized by relatively long runs, that is, successions of equal results; 
and that probability compensation, as represented by (2), is characterized 
by comparatively frequent alternations, that is, transitions from one result 
to the other. This fact finds its expression in the following statistics. We 
have for (1) 

P(A,B) « T«r ~ J P(A,B) = rVV - \ 

P(A.B y B l ) * tf P(A.B,B l ) = ** (1 ; ) 

P{A.B,B ■) = n ~ i P(A.B,B ») = f* ~ I 

In contradistinction to these results, we obtain for (2) 

P(A,B) = rffr ~ * P(A,B) = T % ~ § 

P{A.B,B l ) = «« P(A.B,B *) = ft (2') 

P(A.B,B 0 = U P(A.B,B') = « 

We turn now to the theoretical treatment of this sequence type. The 
elimination theorem (2, § 19) gives 

P(A .A l ,B l ) = P(A.A\B ) • P(A .A'.B,B l ) 

+ P(A.A*,3) • P(A.A'.B,B 0 


(3) 
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Using (6, § 28), we find 

P(A,B) = P(A,B) • P(A.B,B') + [1 - P{A,B)\ • P(A.B,B') (4) 

We now make use of the existence rule, which was extended in § 28 to in¬ 
clude phase symbols. According to this rule, (4) may be considered as deter¬ 
mining existence even if we do not know whether the two quantities 
P(A.A l ,B l ) and P(A.A l ,B) in (3) exist. Now (4) contains as an unknown 
quantity only P(A,B), since the two probabilities P(A B } B l ) and P(A B,B l ) 
are given as existing. It then follows from (4) that the mean probability 
P(A,B) exists. 

The value of P(A,B) may be obtained from (4). We use the abbreviations 
P(A,B) = v P(A .B,B l ) = Pl P(A.B,B 1 ) = q, ( 5 ) 

Then (4) can be written in the form 

P = ppi + (l - p)qi = -- - - (6) 

1 “ Pi + Qi 

Hereby P(A,B) is expressed as a function of P(A.B,B l ) and P(A ,B y B l ). 
To make this relation clearer, we introduce a notation somewhat different 
from (5): 

P(A 9 B) = V P(A B,B l ) = P + 6 P(A .B,B l ) =p- v (7) 
Then (6) is transformed into the relation 


V V 

= - ( 8 ) 

1 — p c 

It can be seen at once that e and rj must have the same sign. If both are 
positive, there is a drag; if both are negative, there is a compensation. In 
figure 7 the relation (8) is illustrated for both t 

cases. We recognize that p y in general, does i- \ *~* \ ** \ - 1 

not lie exactly halfway between p x and q\\ for ^ P P* * 

p > 1 — p y p lies closer to pi, for p < 1 — p } j < 
p is closer to q\. Only the case p = ? has sym- o p, p <j,f 

metry, as is shown by examples (1) and (2) Fig. 7. Graphical representation 

We must investigate in which manner the 0 f transfer probabilities, accord- 

probability depends on the second predecessor. in s to (6) and “ i* , U P” 

r, » , j , ■« r P er diagram: €=+*, = 

It was pointed out above that for the case of Lower diagram :«==-A, *7 = 

probability transfer the immediate predecessor 

alone is relevant. The statement must now be made more precise. Proba¬ 
bility transfer is defined by the condition 

P(A.Bl. . . B'ClBl.) = P(A.BZ},Bl) 


Fig. 7. Graphical representation 
of transfer probabilities, accord¬ 
ing to (6) and (8), for p — |. Up¬ 
per diagram: e = +A, V * 
Lower diagram: « = — V - — iV- 


( 9 ) 
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This relation expresses the condition that the probability of an element is 
determined by the immediate predecessor alone. Since we have 

( 10 ) 

it follows that, for instance, 

P(A.B y iZ?,Bl) ^P(A,Bi,) 

We find that an influence of the second and of further predecessors also 
exists; but this influence has its origin only in a transfer from element to element. 
For this reason the name of probability transfer has been chosen for these 
sequences. The transfer is formulated in (9) by the condition that, when 
the first predecessor is taken into consideration, the other predecessors no 
longer have any influence; they effect a deviating selection only in compari¬ 
son to the major sequence. The influence upon the element immediately fol¬ 
lowing is still to be felt for the succeeding elements. This is shown by the 
fact that in (1) the higher probability for a B as the first successor of a B 
results in a somewhat higher probability for a B as the second successor, 
though the latter probability is not so high as the former. 

Analyzing the relation formally, we have, according to the elimination 
theorem, 

P{A . B,B 2 ) = P{A . BJP) ■ P(A . B. B\B 2 ) 

+ P(A .B,B l ) • P(A .B.B\B 2 ) (11) 


Because of (9) we have the equalities 


Writing 


P(A.B.B\B 2 ) = P(A.B\B 2 ) = P(A.B,B l ) 
P{A.B.B\B 2 ) = P(A . B l ,B 2 ) = P(A.B,B l ) 


P(A.B,B 2 ) = p 2 


( 12 ) 

(13) 


we obtain from (11), with the use of the abbreviations (7), 

V* = (v + e) 2 + (1 - v - «)0 - v) (14) 

By the use of (8) this can be transformed into 

€ 

Pi = v + «1 «1 = « • Y~Z7^ (! 5 ) 

We now introduce the condition that 

0<p + €<l 0 < p — 77 < 1 (1G) 

We thus exclude the degenerate case of a probability 1. Then it can be shown 
that | € | < 1 — p, and, consequently, j d | < | e |. For a positive value of € it 
follows at once from p + c < 1 that | e | < 1 — p. For negative e the infer¬ 
ence is somewhat more complicated: either it is the case that p g 1 - p, 
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then we have, because of p + e > 0 , also p — | c | > 0, that is, | e | < p and 
| e | < 1 — p; or it is the case that p > 1 — p, then ( 8 ) gives | c | < \rj\, and 

V < 1 

rj | < 1 — p and finally | c | < 1 — p. 
Combining the result with (15), we 


since, for a negative e, 77 must also be negative, it follows from p 
that p + | 77 | < 1 and, therefore, that 
So this relation is valid for any case 
derive 


P 2 ~ p\<\Vi ~ P 


(17) 


This means that after the occurrence of an event B, the probability of obtain¬ 
ing another event B for the second successor lies closer to p than the prob¬ 
ability of having an event B for the first successor. 

Generalizing the result, we shall now prove that the quantities 


P(A.BJP) = Vv 


(18) 


represent a sequence converging toward p. For this inference we use mathe¬ 
matical induction. Assuming that p v has the form 


we derive that 


Pv — P + *v-\ 

Pv+\ = p + € v e, = e„_i • 

For the proof we use the elimination theorem 


1 ~ V 


09) 

( 20 ) 


P(A.B,B '+0 = P(A.B,B')-P(A.B.B%B>+') 
+ P(A . B,B>) • P(A . B. fi',Z?'+i) 


Because of (9), and with the abbreviations (18) and (7), the equation can be 

written , . ( . , N 

Pv+i = Pv • (p + c) + (1 “• P») • (p - rf) 


( 21 ) 
in be 
( 21 ') 


With the help of (19) and ( 8 ) this can be transformed into (20), whereby 
the latter equation is proved to be valid. 

Since y > 2 possesses the form (19), it follows that (20) is valid for every v. 
Substituting successively the value of c„, we arrive at the result 

V* = V + « ' ( 22 ) 

Since according to ( 10 ) we have | e | < 1 — p, the coefficient of € converges 
toward 0 with increasing v , so that p v converges toward p. For a positive e, 
that is, for drag, the sequence converges from one side toward p, but for a 
negative e, that is, for compensation, it converges alternatingly. The p v will 
then lie alternatively on both sides of p. 

By corresponding considerations we can show that the probability of ob¬ 
taining a B in the v-th element after a B assumes the value 

P(A.B,B") = = p - v ■ 1 

which likewise converges toward p. 


(23) 
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The results may be summarized as follows. The probability of finding the 
attribute B in the element y i+ „ determined with respect to an arbitrary 
element y, of the sequence (which may be either B or B), converges toward p 
with increasing v. The probability aftereffect dies down, therefore, and the 
influence of neighboring elements is obliterated. Seen from a distance, any 
element of the sequence has the probability p; only in the immediate environ¬ 
ment can the influence of the predecessor be felt. Checking from this point 
of view, we count, for example ( 1 ), that 

P(A.B,B 2 ) = U (24) 

Since in this case e = we expect, theoretically, according to (15), the value 
% + ts ^ !> which is in good agreement with (24). The number of elements 
in the examples is, of course, not large enough to admit an objective dis¬ 
crimination between such a value and the value 

In spite of (24) and (15), it is the first predecessor only that determines 
the probability. This is seen by counting in ( 1 ) all the elements that are 
preceded by a group BB and comparing this number with the number of 
elements preceded by a group BB. We find 

P(A B B\B 2 ) =||-| P(A .B.B\B 2 ) = fg - § (25) 

The probabilities are equal, and are the same as those resulting for a selec¬ 
tion by the first predecessor alone, corresponding to ( 12 ). It should be em¬ 
phasized again that these numerical values are given only as an illustration; 
the number of elements in the example is too small for numerical proof. 

The quantity e characterizes the degree of probability transfer, and may 
therefore be called the degree of transfer. A small value of t represents a weak 
probability aftereffect; for t = 0 the sequence becomes free from aftereffect. 
In the case in which € is positive and has a large value, there exists not only 
a high probability for B as immediate successor of B y but the high degree of 
transfer penetrates to further successors, so that the quantities p P become 
large, too. Because of ( 8 ), 77 likewise has a large and positive value. The 
maximum value in this case is e = 1 — p, that is, pi — 1. For 77 9 ^ p ) that is, 
Qi 5 * 0, we then have p = 1 because of ( 6 ). This case may be realized, for 
instance, by a sequence in which only the yi for which i is a square number 
belong to 5, all the other ?/* belonging to B. Here we have pi = 1 , q\ = 1 , 
p = 1. The sequence consisting of elements B alone, too, may be regarded as 
representing this case, but qi is indeterminate for such a sequence. 

If, however, we have qi = 0 in addition to p\ = 1 (that is, e = 1 — p and 
rj = p), then p is no longer determined by ( 6 ). This is a degenerate case, which 
would be realized by a sequence consisting of elements B only, as well as by 
a sequence of elements B alone. In this case (6) does not determine existence, 
and a limit p need no longer exist. An example may be constructed thus: 
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the sequence begins with 10 elements B; then follow 10 2 elements B, 10 3 ele¬ 
ments B , 10 4 elements B, and so on, in an alternating fashion. Here F n (A,B ) 
does not possess a limit, but fluctuates continuously between the limits 1 
and 0; pi and qi exist, however, and we have p\ = 1, qi = 0. 

If e is negative and has a large absolute value, we find a high probability 
for a change, which, according to (22), penetrates strongly to further suc¬ 
cessors, with a resulting high probability for B as second successor of B ; that 
is, there will be a tendency toward double change. The extreme value here 


is | € | = p, to which the value 


V 2 

1 - p 


corresponds; these values can be 


assumed, however, only if p ^ This follows from the inequalities used to 
derive (17); they lead, for p > 1 — p with p — rj ^ 1 (not only, as stated 
there, with p — rj < 1), to the result | c | < 1 — p, which would represent a 
contradiction for | e | = p. But for p = \ this extreme case is possible, and 
the probabilities p v will converge also for p < according to (22). For the 
special case p — the strictly alternating sequence (1, § 26) represents a 
model; the probabilities p v assume here alternatively the values 0 and 1, 
corresponding to (22). 

The example shows that there is a continuous transition from the un¬ 
ordered, or normal, sequences to the extremely ordered ones, corresponding 
to the continuous scale that is open to the degree of transfer. For the case 
p = 5, e = 0 represents the normal sequence; and e — —^ supplies the 
strictly alternating sequence. The values of e between these extremes, or 
between 0 and + represent intermediate types of sequences. For these 
reasons, a conception that excludes the extremely ordered sequences from 
the concept of probability sequence could hardly be regarded as consistent 
with the principles of a scientific terminology. 

A comparison of a nondegenerate sequence that possesses probability 
transfer with a normal sequence of equal probability for B reveals that the 
sole difference is in the phase probabilities, a difference that diminishes with 
increasing phase length, so that the sequence gradually approaches the type 
of a normal sequence with respect to higher phase probabilities. It is possible 
to compare sequences having probability transfer with normal sequences in 
still another sense, when their domain of invariance is investigated. It turns 
out that we can even postulate a regular domain of invariance (see § 30) for 
sequences with probability transfer. We cannot, of course, tell from the defi¬ 
nition of probability transfer as given, whether the sequences are regular- 
invariant. As in sequences without aftereffect, this is an additional property, 
which may be postulated by definition within the mathematical calculus of 
probability, or which must be tested by observation for empirical sequences. 
But there is no contradiction in combining the condition of a regular domain 
of invariance with that of probability transfer. This condition means, accord- 
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ing to (1, §30), that we have for the regular division S\ K (1 ^ k < X) the 
rdatl0nl P(A.$ X „B) = P(AJi) (26 a) 

P(A.Sl'.B\, . . . = P(A.B\ t . . . B1Z!,Bl) (26 b) 

Since the sequence is not free from aftereffect, the phase probability with 
selection S\ K can be compared only with the same phase probability without 
selection S\ K ; obviously the phase probability occurring in (266) is not equal 
to P{A,Bi v ). This extension is made possible by the form of the definition 
of the domain of invariance given in (1, § 30). 

Now it is possible to construct an important special case by assuming 
a = l,i'=X + l, and dropping the terms . . . B - Furthermore, put 
first B for Bi , and B iry then B for Bi lf and B for Bi v . Deducting the number 
1 from the superscripts according to (5, § 28), we obtain 

P(A .S\ k .B,B*) = P(A .B,B X ) = p x 

1 ^ k g X (27) 

P(A . S\ K . B,B X ) = P{A .BJP) = q\ 

Since B and B x belong to the same subsequence selected by S\ K1 (27) states 
that this subsequence is once more a sequence with probability transfer, 
except that here p\ and cj\ replace the transfer probabilities pi and qi existing 
for the major sequence. Since p\ and q\ lie closer to p than do pi and q iy 
(27) expresses the fact that the subsequences arising from a regular division 
are the more similar to a normal sequence, the greater X. Even in these sub¬ 
sequences, only the immediate predecessor within the same subsequence is 
relevant to the aftereffect, in analogy to (9). This follows because with (9) 
and (266) we have 

P(A .. Bi..B\,L O = P(A. B io .B) k ,B?J 

= P(A.B\,B?J (28) 

The latter equality is easily derived as follows. Using the decomposition (21), 
we can apply (9) and find 

P(A.Bl . . .B’i..B?XB?$ 

= P(A . B^XbI t 3 ,) = P(A .Bi.^Bl +I ) (29) 

We then extend the relation recursively, using the decomposition (21), for 
any chosen phase distance. 

Whether a given sequence with probability transfer is regular-invariant 
is a condition to be tested in every single case, as was pointed out above. 

1 We write here B il . . . Bi v , although we deal only with an alternative B and B. The sub¬ 
scripts Jii . . . iw admit, then, only of the two values 1,2: By signifies B ; B 2 stands for B. 
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This additional specification represents a part of our knowledge about the 
sequence under consideration—a knowledge that is not included in the state¬ 
ment of probability transfer alone. Thus the sequences with probability 
transfer, ( 1 ) and ( 2 ), produced by casting dice according to a certain rule, 
are regular-in variant. 

An example of a sequence with transfer that is not regular-invariant can 
be obtained as follows. At a seaside town we record continuously the height 
of the tides; these numbers possess probability transfer, because a very high 
tide is always followed by a reasonably high tide. If we now carry out a reg¬ 
ular division, the sequence remains invariant, in general. An exception, how¬ 
ever, is provided by the division X = 14. This value of X corresponds to the 
14-day half-period of the moon, which results in particularly high tides for 
the full moon and the new moon, so that we have P(A .S\ K ,B) 7 ^ P(A,B). 

The type of nonnormal sequence produced by probability transfer, accord¬ 
ing to (9), is not only important for purely statistical applications but has 
groat significance for physics in general. For it enables us to characterize the 
case of causal connection as it occurs in causal chains. Physical action is 
always action by contact . When causal connection assumes the form of a prob¬ 
ability connection, we shall find sequences in which the immediately preceding 
event alone determines the probability of the succeeding event. We can 
therefore interpret (9) as the mathematical expression of the physical phe¬ 
nomenon of action by contact. In particular, it is the case of a positive sign 
for € and 77 , that is, probability drag, which occurs in nature*. 

To give an idealized example, observe the path of a gas molecule within 
a closed vessel, dividing the vessel into r cells corresponding to the disjunction 
B\ V . . . V B r . Note regularly at short intervals A£ in which cell the molecule 
happens to be. Because of collisions, the molecule will follow a zigzag path 
with varying velocity. If the molecule is at the time At, in the cell B k , it is 
most likely to be found in the neighboring cells at the time Ab+i- Over a 
longer period, however, this probability will die down; eventually, every cell 
will possess equal probability, so that the influence of the initial position is 
obliterated. 

§ 34. The Probability Lattice 

It is a consequence of the definition of probability given in § 9 that the 
degree of probability is considered as a property of a sequence in its entirety. 
In many applications, however, we deal with sequences for which we want to 
express the fact that a definite probability exists for each individual element, 
that is, the probability is constant from element to element. In order that 
such a statement will not contradict the general logical structure of the prob¬ 
ability concept, we must investigate how the statement about an individual 
element of a probability sequence can be translated into a statement about 
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a whole sequence—that is, in the language of the frequency interpretation, 
how it can be expressed as a statistical statement. 

The definition of the normal sequence as well as of probability transfer 
represents a step in this direction. For instance, we say that in a normal 
sequence the probability of an element does not depend upon its predecessor; 
this is formulated so that we speak about the probability within a new se¬ 
quence, namely, in a subsequence. In the case of probability transfer, we 
likewise translate the statement that the probability of an element does 
depend on its predecessor into a statement about a new sequence. The pro¬ 
cedure indicates, in principle, how to transform statements about the prob¬ 
ability of individual elements into statements about the properties of an 
entire sequence. 

The methods given so far, however, are still insufficient. The reason is 
that a selection collects an infinite number of elements of the original sequence 
into a new sequence; therefore it supplies a statement, not about an indi¬ 
vidual element, but about a subsequence. The formulations so far given thus 
do not permit us to express certain assertions that are actually made in prac¬ 
tical statistics. When we produce a probability sequence by throwing a die, 
we demand that each throw be played with the same probability, that is, 
there should not be occasional exceptions where a loaded die is used or the 
die is insufficiently shaken. The requirement signifies a well-defined condition 
for the physical production of the sequence, but the condition is not incor¬ 
porated in the foregoing definition of a normal sequence. For example, if we 
do not make the fourth throw properly, but produce it by deliberately placing 
face 6 up, the incorrectness will not show in the limit of the frequency, because 
a single throw does not alter the limit properties of the whole sequence. The 
definition of the normal sequence would exclude only the possibility of pro¬ 
ducing every fourth throw artificially. 

We can supply even farther-reaching examples in which all the elements 
of a sequence are played with different probabilities, whereas we are not 
able to detect this mode of playing when we use only the means of structural 
characterization so far developed. For instance, a roulette wheel can be con¬ 
structed with a variable width of its sectors, so that a suitable adjustment 
produces any chosen degree of probability. When we now play, with the 
probabilities p* changing from one throw to the next so that they converge 
to a limit p , the sequence will possess all the properties of a normal sequence. 
It follows that by means of the methods so far used we should not be able to 
discover from the observational results that the sequence was nonnormal; we 
should call it a normal sequence. 1 

1 This statement holds not only for my definition but also for all other definitions discussed 
above, including the definitions of logical randomness given by von Mises and Church. 
See also p. 280. 
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In order to develop methods that characterize a sequence from this new 
point of view, we must analyze what we mean by saying that the probability 
varies with the width of the roulette sectors. If we keep to a sector width 
once adjusted and continue playing, the result for the sequence will be a 
certain limit that differs from the limit of the sequence mentioned first. The 
idea can be formulated statistically when we refer the probability statement, 
not to a single sequence, but to an infinite class of sequences, which may be 
written 


y a 

2/21 

2/12 

2/22 

2/13 

2/23 

2/14 

2/24 

. Vu 

. 


2/jfei 

Vk2 



. . • . 2 hi 



Such an arrangement may be called a sequence lattice , or a 'probability 
lattice . Assume that the horizontal rows represent sequences of equal prob¬ 
abilities p for B and are normal sequences in the wider sense (as we shall now 
call the normal sequences defined above). It does not follow from this defini¬ 
tion, however, that the vertical sequences also possess the limit p of the fre¬ 
quency or that they are normal; this could not be derived even if the mutual 
independence of the horizontal sequences were assumed. Imagine, for in¬ 
stance, that all the horizontal sequences contain, in the fourth throw, one 
result obtained by placing face 6 up; this would show in the lattice by the 
fact that the fourth vertical sequence possesses for the “6” the frequency 1 
as its limit. Or we could produce all horizontal sequences by playing with the 
adjustable roulette wheel according to the procedure mentioned; the prob¬ 
abilities pi would then change from el ('men t to element and approach a 
limit p. This procedure would show up in the limits of the vertical sequences, 
since these limits would be equal to p t . 

Reversing the inference, we can express the assumption that the prob¬ 
abilities within the horizontal sequences are the same from element to cle¬ 
ment by postulating that each vertical sequence should likewise possess the 
limit p for its frequency. This presents a definition of normal sequence that is 
narrower than the one previously given. We shall speak, therefore, of normal 
sequences in the narrower sense. The definition makes use of a new means of 
structural characterization: the probability lattice. 

The narrower definition of normal sequences differs essentially from the 
previous one. The first definition was concerned with an individual sequence, 
and reference w^as made to the properties of this sequence alone. The second 
definition concerns an infinite sequence of sequences, or a class of sequences, 
and thus specifies not a property of the single sequence but, rather, a property 
of the totality of sequences. Whether an individual sequence conforms to this 
definition depends not only on the sequence itself but also on all the other 
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sequences belonging to the class under consideration. 2 This peculiarity of 
the definition, however, has a great advantage from the logical point of view. 
In many applications we deal with sequences of sequences, which are defined 
by a single physical rule. For instance, the rule about throwing dice, in a gen¬ 
eral formulation, defines not only a single sequence but a class of sequences, 
the properties of which are characterized by the use of the lattice. To say 
that the probability of the horizontal sequences is constant from element to 
element has a meaning because every individual element is imbedded in 
another sequence, the vertical sequence; and again we have expressed the indi¬ 
vidual property by means of a class property. 

But this procedure also is limited. If in the lattice (1) the individual element 
?/ 2 3 w r ere played with a deviating probability, the fact would not be recog¬ 
nizable in the sequence lattice. This consequence does not indicate a mistake 
in the conception of probability; it merely proves that we can speak of a 
specific probability of an clement only when the element is imbedded in a spe¬ 
cific sequence. When, in the example of the adjustable roulette wheel, we 
speak of a varying probability in the horizontal sequence, the statement has 
meaning only because the varying probability can be represented once more 
as a property of sequences. Every adjustment of the roulette wheel deter¬ 
mines a sequence, namely, the sequence of throw r s obtainable by the adjust¬ 
ment. And only because of the interpretation in terms of a sequence can we 
claim meaning for statements correlating a probability to every adjustment 
of the wheel. 

The lattice makes it possible to express such statements formally. Even 
the assertion that, for instance, the individual term y 2 3 is played w r ith a differ¬ 
ent probability can be formulated in a similar manner, when we use a lattice 
of three dimensions, in wLich the sequence correlated to 7/23 in the third 
dimension indicates a different probability. It must be possible, in principle, 
to translate statements about the variability or the constancy of certain 
probabilities produced by physical devices into statements concerning se¬ 
quences of an n-dimensional lattice, if the statements arc to have any mean¬ 
ing. The lattice is a conceptual means of characterizing certain important 
properties of sequences that cannot be formulated by the use of selections 
from the sequence under consideration. Hence w r e shall turn now to the defi¬ 
nition of various types of sequences in terms of a lattice. 

Let us first give a precise definition of normal sequences in the narrower 
sense. Apart from the condition concerning the limits in the vertical direction, 
we wish to incorporate in this concept also the condition of the mutual inde¬ 
pendence of the sequences. It was explained in § 23 that the independence 

2 This type of definition, which is suitably called a relative definition , is well known in the 
theory of implicit definitions. For instance, it is used in the definition of geometrical elements 
such as point, line, and plane. See H. Reichenbach, Philosophic der Raum-Zeit-Lehre (Berlin 
and Leipzig, 1928), p. 118. 
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of every two sequences does not suffice to insure the independence of three 
and more sequences; the independence relation is not combinable. Conse¬ 
quently, we should demand the mutual independence of all sequences; but 
such a notion would lead to difficulties for infinitely many sequences, since 
it would refer to probability expressions containing infinitely many class 
symbols in the first term. We can avoid this consequence by introducing a 
corresponding condition in a finitized form: we lay down the condition that 
for every combination of any v sequences there exists complete independence 
in the sense of § 23. Sequences that satisfy this condition may be called 
independent by combination. We now set up the definition: 

A sequence of sequences constitutes a system of normal sequences in the nar¬ 
rower sense if all the horizontal and vertical sequences arc normal in the wider 
sense and possess the same probability p, and if the horizontal as well as the 
vertical sequences are independent by combination. 

An extension of the* abbreviated notation will be used to represent the 
definition within the calculus. Instead of the expression 

(0 (Xki e A y ki e B) (2) 

v 

we write, as an abbreviation, 

(A^B^y (3) 

V 

or, in the P-notation, 

P(A,B*'y = V ( 3 ') 

We use here* the superscripts of class symbols, as in B ki , for the expression of 
the subscripts that are added, in the detailed notation, to the symbols x,y 
of the elements. The phases that were formerly expressed by the super¬ 
scripts can now be added by the use of additive terms, for instance, in the 

form P{A.B ki ,B k ■<+>■)' (4) 

The superscript of A may be dropped on account of (4, § 28), since the lattice 
A is assumed to be compact. The general rule of translation (p. 49) may then 
be supplemented as follows: 

The superscripts of B stand , in general , for subscripts belonging in the detailed 
notation to the corresponding y; one superscript signifies only the phase of the 
subscript in y; two superscripts represent the subscripts of y themselves. The 
superscript added to the parentheses of the P-symbol indicates the variable bound 
by the all-operator , that is, the running superscript. 

Besides bound, or running, superscripts, free superscripts, such as the super¬ 
script k in (3) and (3'), will occur. Such a superscript constitutes a free vari¬ 
able. When a formula contains a free variable, this means that the formula 
is valid for all values of the variable; thus the relation (3') is meant to hold 
for all values k. In this manner we can express in the P-notation that a 
formula is generally valid, without being compelled to use an all-operator. 
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The new symbolism will now be used for some necessary definitions: 

A lattice is homogeneous if all the horizontal and vertical sequences possess 
equal probabilities, that is, if 

P(A,B ki y = P(A,B ki ) k = p (5) 


Furthermore, we can express in the calculus the concept of being inde¬ 
pendent by combination . It seems advisable to write this expression for a 
many-term disjunction B\ V . . . V B r , which, as usual, may be complete and 
exclusive. Then the phrase “independent by combination” is defined by the 
following relations, which are supposed to hold for any chosen values 


P(A .fit?" '■ . . B i+r ' ’’ \ B k ^y = P(A,B “)*' (6a) 

P(A . . B tB k X) k = P(A ,B U) * (66) 


The property is formulated for horizontal sequences in (6a), for vertical se¬ 
quences in (66). When a lattice of normal sequences in the wider sense satis¬ 
fies the conditions (5) and (6)—in other words, when in a homogeneous 
lattice all the horizontal and vertical sequences are normal in the wider sense 
and independent by combinations—the lattice constitutes a system of normal 
sequences in the narrower sense. 

The homogeneous lattice, then, represents a wider concept than the normal 
sequences in the narrower sense; it defines a more general type of sequence. 
A still more general sequence type is obtained by the condition that, whereas 
the horizontal sequences all have the same probability p, the vertical sequences 
possess the probabilities pi converging toward the limit p. The system may 
then be said to form a convergent lattice. 

The behavior of sequences with respect to regular divisions was character¬ 
ized above by the use of the concept regular-invariant . In a similar manner 
the concept lattice-invariant , which concerns the behavior of sequences with 
respect to lattice enumeration, can be defined as the property that every 
phase probability of each individual sequence has the same value as the cor¬ 
responding probability in lattice enumeration. 3 When, once more, a complete 
and exclusive disjunction Bi V . . . V B r is used and the phases are added, 
according to (4), the definition of lattice-invariance may be written in the form 


P(A.B 


ki 


. . . B^:?’-\B k ^ i+y ) k = P(A Bm, . 


yki 


T>k, i+r—1 
. iJ m H 


,B k j + y ( 7 ) 


v> 0 


The normal sequences in the narrower sense are lattice-invariant, since all 
the probabi lities (7) are equal to P(A,B mp ), according to (5), (6), and (3, § 29). 

8 1 do not include in the concept “lattice-invariant” the requirement that also the major 
probabilities in the horizontal and the vertical direction should possess the same value, as 
would correspond to the homogeneous lattice; a nonhomogeneous lattice may equally be 
lattice-invariant. 
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But sequences with probability transfer can also be lattice-invariant, just as 
these sequences can be regular-invariant. Consider a lattice in which all hori¬ 
zontal sequences possess for the alternative B or B the same probability 
transfer, statable in terms of p, e, 77. It then follows from the property of the 
lattice invariance (7) that, using (18 and 23, § 33), we have 

P(A .B ki , B k -*+') k = P(A.B ki , B k ^y = P{A.B,B V ) = p, 

' P(A .B k \ B k 'i+') k = P(A B ki , B k ^Y = P{A . B,B V ) = q, (8) 

By the help of these relations a particularly simple case can be constructed, 
when all horizontal sequences are assumed to begin with the term B. Because 
of (8) we then have 

P(A,B ki ) k = P(A.B k \B ki ) k = P(A.B y B^ 1 ) = (9) 

The sequences form a convergent lattice the vertical sequences of which 
represent the phase probabilities p , with v = i — 1. Such a lattice may be 
called a lattice of mixture. 

We cannot prove that every lattice of sequences with probability transfer 
has the property of lattice invariance formulated by (8). We must rather 
make this requirement the definition of a lattice type. The derivation in § 33 
shows only that the values p, hold for an enumeration in the horizontal direc¬ 
tion; it does not concern an enumeration in the vertical direction. When the 
assumption is extended to vertical sequences, this means, physically speaking, 
that every individual element of the sequences is played, respectively, with 
the probability p + € or p — 77 pertaining to it. This property, in fact, is 
found in many applications. When we construct a lattice by playing with 
dice according to the rules for probability transfer given in § 33, the vertical 
sequences fulfill equation (9) for p„ with v = i — 1. We see that, as before, 
certain properties of the physical setup can be formulated only by the use 
of a lattice; they cannot be expressed for a single sequence. 

Examples of the lattice of mixture are physical processes that represent 
a mixing of substances. Imagine two liquids in a vessel that is divided by a 
wall into two compartments, with one liquid in each half. The first half of 
the vessel may be designated by B } the other by B. We then withdraw the 
partition wall and permit the liquids to mix. Imagine that during the process 
we can note regularly, at frequent intervals, for each molecule of the first 
liquid whether it happens to be in B or B. We then obtain for each molecule 
a sequence with the probability \ for B; but it has probability transfer, be¬ 
cause a molecule that happens to be in B will, soon afterward, still be found 
in B (similar to the example mentioned at the end of § 33). We now write the 
sequences for the molecules of the first liquid one under another, disregard¬ 
ing the molecules of the second liquid. The sequences will represent a lattice 
of mixture in which the first element of every horizontal sequence is given 
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by B. In the vertical sequences next to the first one (which, incidentally, do 
not have probability transfer in this case), the element B will still be prevalent 
corresponding to the probabilities p v . As we go to the right, the limits of the 
frequency of B in the vertical sequences will gradually approach the value 
that is, the liquid will spread over the whole space of the vessel. In this manner 
is expressed the diminishing influence of the initial position, which is precisely 
what characterizes the peculiarity of all mixing processes. Without the use 
of a probability lattice the mixing process cannot be formulated in the" theory 
of probability. 

It should be emphasized that the fact of mixing cannot be derived from 
the general principles of the probability theory. 4 This is as impossible as to 
derive, from these general principles, that a sequence of throws of a die is 
a normal sequence. Like the latter case, the phenomenon of mixture must be 
regarded, in the general theory of probability, as a special case, the occurrence 
of which can be asserted only on empirical grounds. The physical phenomenon 
of mixture is not a consequence of the general laws of probability. That 
natural processes will represent a process of mixture can be mathematically 
derived only wdien we know that the processes constitute sequences with 
probability transfer that are lattice-invariant. Them, however, the result is a 
tautological statement. The theory of probability cannot prove that the process 
of mixture will occur; it can only supply the logical schema by which the 
process is to be interpreted. This is done by the use of the lattice of mixture, 
which formulates the conditions from which the occurrence of the process is 
derivable. 

The kinetic theory of gases includes the inference from the time totality to 
the space totality. 5 It is based on the assumption that the time sequence given 
by the states of one molecule (or of one system) exhibits the same statistical 
relations as the space sequence given by the states of different molecules 
(or systems) at the same time. This inference, too, is a lattice inference; it 
must be interpreted as the assumption that the sequences form a homogeneous 
lattice. The theory of probability cannot decide whether this assumption is 
valid. Such an assumption, rather, represents a physical hypothesis the validity 
of which must be ascertained in the same manner as for any other hypothesis. 
The hypothesis is tested by the observational examination of its consequences, 
applying the usual inference discussed in §§21 and 85. But to suppose a 
hidden mathematical secret in the inference from the time totality to the 
space totality is to misunderstand the theory of probability. Like all other 
forms of deductive inference, the calculus of probability cannot bring to light 
more results than were invested in the premises by suitable assumptions. 

4 The opinion has often been expressed that Bernoulli’s theorem leads to a proof of the 
mixing process. This, however, is not correct. See p. 280. 

6 See P. Hertz in Weber-Gans, Repertorium der Physik (Leipzig, 1916), Vol. I, Part 2, 
Sec. 242. 
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§ 35. The Average of a Sequence of Quantities 

Leaving the investigations concerning the theory of order, we turn to the 
treatment of concepts that were developed for the application of the theory 
of probability to statistical problems. Many such problems involve classes 
B m to which numerical amounts u m are coordinated (amounts of money, 
lengths of distances measured, etc.). For the treatment of such classes the 
concepts of average and dispersion were introduced. We begin with the con¬ 
cept of average. For its definition, we first define the analogous concept with 
respect to a sequence of a finite number of quantities, which is called the 
mean value; the average is then the extension of this concept to a sequence 
of infinitely many quantities. 

The amount u m coordinated to the class B m represents one of the possible 
values of the amounts occurring. From it we distinguish the individual value u { 
that is found combined with the element y* of the probability sequence. 
This distinction may be indicated by the use of subscripts and superscripts. 
If t/i belongs to the class B m , we have u { — u m . The subscript of u runs through 
the r values of the disjunction B\ V . . . V B r ; the superscript of u assumes 
all numerical values from 1 to . It will be obvious from the context that the 
superscripts do not express arithmetical powers. 

In the simpler case in which the sequence A is compact, that is, where all 
Xi belong to A , all u* fall into the sequence considered. If we cut off this 
sequence at an element u n , the mean value of this finite sequence is defined by 

1 1 n 

MHu,*) 1 = - [u l + . . . + u n ] = - JZ u { (1) 

n n imml 

The repetition of the superscript i outside the parentheses in the term on the 
left side is to indicate its character as a superscript of summation. With 
increasing number n we can coordinate to every element u n the mean value 
M n (u i ) i taken at this place; and the question arises whether the quantities 
M n {u i ) i approach a limit. If this is the case, we define as the average of the 
quantities u the value 

1 n 

M(u i ) i = lim M n (u i ) i — lim - u { (2) 

n-+oo n -+ ca n i i 

The average is thus the limit of the mean values. We call (1) and (2) the 
statistical definition , respectively, of the mean value and of the average; this 
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term expresses the fact that the definition characterizes these quantities 
through direct enumeration. 

Consider, for example, the sequence produced by the throwing of a die. 
The numbers 1 to 6 written on the faces of the die supply the values u m ; 
u* is the number of the face of the die that lies up in the f-th throw. We 
determine M n (u i ) i by adding all the numbers that have appeared up to the 
n-th throw and dividing them by n. The average of this value, that is, its 
limiting value for n co ? is 

We can extend the definitions (1) and (2) to the case in which the sequence 
A is not compact and for which we do not take into account all the elements 
but only those that are coordinated to the elements yi that correspond 
to an Xi belonging to A. To express this definition in the symbolism we use 
the symbol V(a), introduced in § 4, meaning, “truth value of the statement 
a”. However, we deviate from our previous notation by assigning to V(A) 
the values 1 or 0 according as a is true or false. With the help of this con- 

n 

vention we can replace the symbol N, which was introduced in § 10, by a 
summation: 1 “ 1 

N fa e A) = V(x t eA) (3 a) 

1 *-i 

Abbreviating the F-symbol in the form 

F(A0 - Df V{ Xi e A) V{A\B') == Df V[(x i eA).(y i eB)] (3 6) 

we can write for (3a), using the abbreviation (2, § 16) for the W-symbol, 

N n (A) = ± V(A •) (3c) 

x = 1 

The frequency interpretation of probability is then written 

E V{A i .B i ) 

F n (A,B) = -±=1- 

E V(A') (4) 

i-1 

P(A,B) = lim F n (A,B) 

n -+ co 

The statistical definition of the mean value and of the average can now be 
symbolized by n 

0 

M n (A ]u i y = - 

E V(A') 

i* 1 

M(A)u i ) i = lim M n (A;u i ) i 

n -* 


( 5 ) 
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The symbols written on the left side represent generalizations of the symbols 
occurring in (1) and (2); their definitions are identical with (1) and (2) if the 
sequence A is compact. 

This presentation of the concepts of mean value and of average makes it 
clear that the sequences of quantities u { involve a certain generalization of 
the concept of probability sequence. Whereas in a probability sequence each 
element is referred to a class B m and thus possesses a qualitative property, 
the sequences introduced in this section are composed of elements yi that 
possess quantitative properties u m . It is often convenient to forget about the 
sequence of the events yi and to deal directly with the sequence of the num¬ 
bers u\ This number sequence has the character of a probability sequence 
if B m is regarded, not as a class of elements yi, but of elements selected by 
the condition u { — u m . Instead of the enumeration of elements, the addition 
of quantities can then be employed. The enumeration may be regarded as 
a special case of addition, for which the u' are represented, in particular, by 
the truth values 1 and 0. For this particular case the concept of average is 
identical with the concept of probability, since if in (5) u* is replaced by 
V(B { ), formula (4) obtains. 

Conversely, replacing addition by enumeration, we can reduce the concept 
of average to the concept of probability. For this purpose we use the classifi¬ 
cation of amounts as given by u mj counting how often the amount u m occurs 
and multiplying u m by this number. The addition of these values gives the 
same result as the direct addition of the individual values u\ Employing im¬ 
mediately the more general definition (5), we are able to formulate this idea 
within the calculus when we add the always-true disjunction B\ V . . . V B\ and 
carry out the multiplication 

i> ■ V{A\[B\ V . . . vb;d 
M n (A --- 

Z V(A') 

t -1 

• V{A\B\) +...+!>• ViA'.Bi) 

= —-=- 1 ~-- ( 6 ) 

Z r(A-) 

»-1 

With (4) we have 

2>* • ViArBl) Z V(A\B' m ) 

1 = 1 ____ _ u . L=J_ 

U m 

Z V(A>) E 

»«»1 i — 1 

= u m ■ F"(A,B m ) 


(7) 
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Substituting this value in (6) and introducing the new symbols for mean value 
and average written on the left side of the following formulas, we obtain 

r 

M n (A,B m ;u m ) n = 23 F n (A,B m ) ■ u m (8a) 

m«= 1 

r 

M(A,B m ;u m ) m = lim M n (A,B m ;u m ) m = 23 P(A,B m ) ■ u m (86) 

n~+ oo m *= 1 

Because of the completeness of the disjunction the relation holds 

23 P(A,B m ) = 1 (9) 

rn ™ 1 

The expressions (8) are called the theoretical definitions , respectively, of the 
mean value and of the average. This term expresses the fact that the defini¬ 
tion is achieved by means of a theoretical transformation of the statistical 
definition, and thus is constructed in an indirect manner. The definition re¬ 
duces the concept of average to the concept of probability. The equivalence 
of both kinds of definition follows from the fact that the right side of (8a) 
is a transformation of (6); this equivalence is formulated by the equations 

M'iAyu*)* = M n (A,B m ;u m ) m (10a) 

M(A ;u { y = M(A f B m ;u m ) m (106) 

To simplify the notation, the symbols occurring in (10) are abbreviated by 
omitting the classes A and B. This omission is, in general, without danger, 
since it is usually obvious to which probability classes the mean value is 
referred; in doubtful cases we shall return to the exact notation (10). We 
then obtain from (10) the equations 

M n (u i ) i = M n (u m )m = M n (u ) (11a) 

= M (u m ) m = M(m) (116) 

The last notation is further simplified by dropping the subscripts, or super¬ 
scripts, upon which the expression does not depend. This simplification is 
possible because of the equivalence of statistical and theoretical definitions, 
which makes superfluous a distinction between the two definitions. 

The importance of formulas (8) consists in the reduction of the calculation 
of an average to the knowledge of certain probabilities, so that the direct cal¬ 
culation of the average according to (5) is avoidable. Thus in the example of 
the die we may calculate the average for all m from (8) with P(A,5 W ) = | as 

M(u) = | • [1 + 2 + . . . + 6] = 3$ 
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Furthermore, the theoretical definition enables us to define the concept of 
average in the formal calculus of probability, for which (5) and the trans¬ 
formations resting upon the frequency interpretation are not applicable. For 
this purpose we write the expression (86) by the use of the abbreviations (116), 
omitting the relation (8a) concerning the mean value M n . Thus we have 

M(u ) = £ P(A,B m ) ■ u m (12) 

m-1 

This relation, which is a transcription of (86), is called the formal definition 
of the average. The expression u m • P(A y B m ) is called mathematical expecta¬ 
tion; correspondingly, the average is also called expectation value. 

A relation following from (12) for a constant k may be noted: 

M(k-u) = * • Miu) (13) 

If the disjunction B x V . . . V5 r has a finite number of terms, the existence 
of the average according to (12) is insured if the probabilities P(A f B m ) exist 
and (9) holds. These conditions guarantee simultaneously, for the statistical 
definition, that there exists the limit Miu 1 ) 1 of the sequence M^w 1 ') 1 '. The 
existence of this limit is thus reduced to the existence of the limit of the fre¬ 
quencies F n (A,B m ). However, if r °° , that is, if the disjunction Bi V B 2 V . . . 
has infinitely many terms, M(u), according to (12), is the limit of a sum. 
Even if all the probabilities P(A y B m ) exist and (9) holds, the existence of the 
limit M{u) is then not warranted by (12), but is linked to certain conditions 
for the u m . Such examples will be given in § 36. 

It should be realized that the concept of average constitutes an arbitrary 
definition, or convention, by which a number of quantities are combined in a 
single value so as to achieve an abbreviation. Of course, this single value 
does not represent the totality in a completely adequate manner; it depends 
upon the special case whether the average is at least a suitable substitute for 
the totality. There are cases for which it is more important to know an upper 
limit for the u m than their average. For instance, the stability of a bridge is 
to be adjusted not to the average, but to the maximum, load. The literature 
on probability includes such definitions as central value , mode , and quartile , 
which are used in a manner similar to “average”, but which are not very use¬ 
ful for technical reasons. The arbitrary character of the concept of average 
is made obvious in applications where the average cannot be interpreted 
directly, as, for instance, in the statement that in a certain country the 
average number of children to a marriage is 2.35. In some cases the average 
is of great practical importance because, by the nature of the amounts con¬ 
sidered, the addition of amounts supplies relevant information. For instance, 
if a merchant averages a profit of M(u) dollars in his business, this means 
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that for n transactions his income is greater by n * M(u) dollars than his 
expenditures; and only this total sum is of relevance to him. 

Two further examples may be discussed, in which the probability structure 
is more clearly seen. The first is provided by insurance companies. They make 
contracts with a large number of clients; their policies stipulate that an acci¬ 
dental damage B m occurring with low probability—for instance, damage by 
fire—is made good by payment of a large sum u m . For the case B t) by which 
we understand B m , and which occurs with a high probability, insurance com¬ 
panies take in a small amount u h the premium. If the number of clients is 
great, the relative frequency may be replaced, in practice, by the probability; 
and the average amount of the payments made is given by M ( u). For M(it) = 0 
the premium would be exactly adequate to the risk; but insurance companies 
adjust the premium in their favor, so that M ( u ) is somewhat greater than 
zero. With this surplus amount they not only cover the expenses of their 
organization, but accumulate reserve capital for the eventuality that there 
should once occur more accidents than are expected. 

A second example is provided by games of chance, in which certain amounts 
u m of money, depending on the probability, represent winnings or losses. 
A game is called fair if M(u) — 0; then the mathematical expectations of 
the gamblers are equal. If, in a game with one die, a gambler bets $5 that the 
“6” will show up, the other gambler can demand $1 as winnings if every 
occurrence in which the “6” fails to appear is counted in his favor. The cal¬ 
culation of the chances in complicated games, which played an important 
role in the writings of eighteenth-century mathematicians, is usually linked 
not to the concept of probability but to that of expectation; but this formu¬ 
lation obviously represents only a different mode of speech. 

Games played in gambling clubs or casinos are never fair in the sense of 
this definition, since for them M{u) > 0 holds in favor of the bank. Thus 
in a game with two dice they offer five times the stake for a bet on “any 7”, 
whereas the chances are only one in six to win. Now it is true that the word 
“fair” does not have a moral meaning in the mathematical definition, but 
merely defines the condition of an average balancing of the winnings. But 
it is strange that so many persons participate in gambling that guarantees 
them a loss if it is continued long enough. The flourishing of gambling places 
should be proof of the hopelessness of successful gambling. However, the sug¬ 
gestive power of the amounts that may be won seems to rule out a sound 
reasoning about what will be won. 

The corruption of reasoning is one of the foremost dangers of gambling. 
Patrons of gambling clubs try all sorts of systems in the hope that they can 
outwit the bank, disregarding the fact that gambling machines furnish random 
sequences and thus are immune to attack by a gambler's skill. Various super¬ 
stitions have arisen from gambling; certain numbers are regarded as “lucky”, 
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others as “unlucky”. And the attempt to “try his luck” has been dearly 
paid for by many a man. 

If the state, or a welfare organization, owns the bank, as in the institution 
of public lotteries, the odds in favor of the bank may seem to be excusable. 
There remains, however, the mischief created by false hopes in the mind of 
the lottery participant. The probability of winning in a lottery is usually 
seriously overestimated. If someone buys a ticket for $5 with the chance of 
winning $20,000, he considers only this pleasant ratio of numbers, forgetting 
that because of this very ratio the winning probability is smaller than 
5 1 

20~000 = 4oqo’ a P r °k a bility 80 small that it is regarded as zero in all other 

cases. For instance, the probability of being killed within a year by an auto¬ 
mobile amounts to about the same value, but no one is concerned in this 
case about such a probability. 1 

It is true that mathematics cannot supply us with value judgments and 
so cannot determine whether gambling and lotteries are morally good or bad. 
But in pointing out the discrepancies between the gambler’s behavior and 
his mathematical chances the mathematician can contribute his share to 
public education. 

§ 36. Formation of an Average When the Summation 
Is Extended to Infinitely Many Terms 

The summation (12, § 35), when extended over infinitely many terms, need 
not necessarily be convergent even if the condition (9, § 35) remains fulfilled. 
But it is, of course, possible that the summation does converge. These con¬ 
ditions will be illustrated by examples for convergence and opposite examples. 

For instance, we play for a certain result B with the probability p, repeating 
the play until B happens for the first time; this set of plays is called a group. 
If the groups are repeated, what is the average length of a group? In other 
words, after how many elements on the average will B occur? We shall 
assume that 0 < p < 1. 

The different groups, each of which extends to the occurrence of B , may 
be collected in a single sequence of normal character, in which B occurs with 
the probability P(A,B) = p. Every group is of the form B l . . . 5 X_1 .J5 X , 
so that the whole sequence is divided into such groups of different length X; 
the predecessor of each group is always an element B. We wish to find the 
average length M(\) of a group. 

1 In 1947 there were 32,300 fatal automobile accidents in the United States, in a popula¬ 
tion of 142.673,000. This is one death case through automobile accident in about 4,400 
persons within a year. 
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The probability of a group of this kind is written symbolically as 

P(A.B,B l . . . B*- l .B*) (1) 

That the enumeration by elements is replaced by the counting of groups is 
expressed by the occurrence of B in the first term of the probability expression. 
For in the frequency interpretation the denominator of (1) is given by 
N(A.B), that is, by the frequency of B, since the sequence A is compact. 
The amount ii\, the average of which is to be found, is the group length X, 
that is, we have _ . 


According to (12, § 35), the desired average is given by 

M(\) = M(u x )x = Z P(A.B,B 1 . . . B'-KB*) ■ Ux (3) 

X-l 

Since the normal sequence is free from aftereffect, the probability (1) can be 
represented by the product of the individual probabilities within the major 
sequence. Thus we have 

P(A.B,B l . . . B'-'.B*) = (1 - p)*" 1 * p (4) 

With (2) and (3) we derive 

CO 

M(X) = £ X • (1 - p)*- 1 • V (5) 

X-l 

Formula (5) determines an average by the summation of an infinite number 
of terms; therefore the existence of the average M(\) must first be proved. 
Proof of convergence, however, can be given. With the abbreviation q for 
(1 — p), we have 

M(X) = V ■ Z X • S *-» (6) 

X~1 

In the infinite sequence occurring in this expression, the quotient of two suc¬ 
cessive terms is equal to 


(X + 1) • q x 
X • q x ~' 


X + 1 


* q 


(7) 


The expression (7) has, from a certain X onward, a value < 1, because q < 1; 
and its limit is also < 1, namely, equal to q. According to a well-known 
mathematical theorem the results provide a sufficient condition of con¬ 
vergence for (6). 

The expression (6) can easily be evaluated directly. We have 

OO 1 CO 

Z = (Z ? x ) 

x-i aq x-i 


( 8 ) 
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Since we are dealing with power series, we can calculate the sum written on 
the left side by calculating the sum on the right side and then differentiating 
this value. According to the summation formula of a geometrical series, we 
have 

rc 0 A+1 _ 0 Q 

22 ( l x = hm --- = —-— since 0 < q < 1 (9) 

X —1 X~*co Q 1 1 Q 

Substituting this value in (8), we arrive at 

( r ^-) = 7r S (10) 

x-t dq \1 — (]/ (1 — qY 

Replacing (1 — q) by p and introducing (10) in (6), we derive 

M(X) = - (11) 

V 

This very simple expression answers the question about the average group 
length. For instance, when we throw a die we can expect a certain face to 
turn up on the average after six throws. 

This result, which was derived purposely by a procedure that, though 
cumbersome, exhibits the general method, is made clear immediately by the 
following considerations. If we throw a die repeatedly until face 6 appears, 
and write the separate groups one after another, a normal die sequence 
obtains. When we denote by X* the length of the i-th group and by m the 
number of groups, the expression 

m 

n = X) X* (12) 

t=l 

measures the number of all the throws made. But the number m of the groups 
equals the number m of all the events B within the sequence, since every 
group has one and only one £, which stands as its final element. The fre¬ 
quency interpretation supplies 

Iim ~ = v (13) 

n~* oo it, 

The average group length, therefore, is given in the statistical definition by 
M(X) = M(\*y = lim — 2 = lim — = - (14) 

« W ^ = 1 » -*■ co W p 

This relation represents the statistical meaning of (11). For instance, in 600 

throws the “6” will appear about 100 times; the average distance between 

600 1 

two occurrences of a “6” is therefore — = 6 = -• 

100 p 
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It is owing to the value u\ given in (2) that the summation occurring in 

(3) is convergent. An example in which no convergence occurs can easily be 

constructed. If we put ox . rx 

= 2 X_1 (15) 

the sum (3) will not converge for certain values p. For instance, for p = \ 
we obtain from (3) with (4) and (15) 

OO 1 

M{ux) x = X) — • 2 X_1 = 00 (16) 

X«1 ^ 

This example, for which no finite mean value exists, is spoken of in the litera¬ 
ture as the Petersburg problem because it was originally communicated in the 
publications of the Academy of Petersburg (now Leningrad). It is usually 
presented as follows. Peter flips a coin repeatedly until tails show up. If tails 
appear at the first throw, Peter receives $1 from Paul; if tails occur for the 
first time in the second throw, he receives $2; if tails occur at the third throw, 
he gets $4. Generally speaking, Peter receives 2 X_1 dollars if tails occur for 
the first time in the X-th throw. Each game is continued until tails show up. 
The question is: How large a sum may Paul ask Peter to pay as his stake in 
every game? 

The “fair” stake would be given by Peter’s average winnings. Since every 
separate game is identical with the group B 1 . . . B x ~ l .B^ } the stake is to 
be calculated by (3) together with (4) and (15), and thus is represented by 
(16). Therefore Paul may ask Peter to bet an infinitely high amount against 
him, to be paid for each individual game. The result may at first seem to be 
paradoxical; no one would feel inclined to risk an infinite amount of money 
if he were offered Peter’s winning chances. We would rather be tempted to 
infer that there must be a finite X for which tails will occur, so that even in 
the most favorable case Peter's winnings can only be finite, and it would be 
excluded that he could win back his stake in the first game. But the stake 
should be chosen in such a way that even for the first game there should 
exist at least the possibility of winning more than the amount placed. 

Eighteenth-century mathematicians were greatly concerned about this 
paradox. Daniel Bernoulli tried to solve it by revising the concept of mathe¬ 
matical expectation. According to him the so-called moral expectation in¬ 
creases more slowly than the mathematical expectation and remains finite if 
the latter becomes infinite. He was guided by the thought that an increase 
in money is the less valuable for a person, the more money he has—an idea 
that was taken over by the economists as the law of diminishing utility. It is 
a mistake, however, to believe that such a consideration can solve the 
paradox. When we translate the probability statements into frequency state¬ 
ments, the solution is easily given, and the problem loses its paradoxical 
character. 
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Assume that a maximum length v of the game is introduced. Only v throws 
are made; if tails have not occurred up to this time, Paul has won the game. 
By cutting off the summation at the value v, we calculate from (1G) 
v 


M(u\)\ = ~ and thus have a game free from any paradox. Peter’s average 
A 

winnings in each game are given by therefore he must pay Paul, for each 
v 2 

game, the amount which seems to be a fair stake. For instance, for v — 4, 
A 


Peter cannot win more than $8, and his stake amounts to $2. The result 
seems fair in view of the greater probability of smaller winnings. If v is greater, 
v 

~ again represents the fair stake; this means that the average winnings of 
A 

Peter are larger. That the expression (16) goes to infinite values means that 
the average winnings of Peter increase continuously with v, so that they are 
v 

always equal to “ and become infinite with v-*- «. If the game is played 


without any limit to v , the average gain of Peter will indeed be infinite, and 
the infinitely large stake seems justified. 

A clear picture of the situation is obtained when Peter’s winnings are re¬ 
constructed statistically. When we denote Peter’s gain in the i-th game by 
u\ Peter’s average winnings up to the m-th game are given by 


M m (u'y = — X) 


m 


(17) 


That the average supplied by the theoretical definition (16) goes toward 
infinity means that the statistically defined average is not convergent either, 
that is, we have, contrary to (14), 

M(tP) 4 ' = lim M m (u i ) i = oo (18) 

m-+ & 


The result has the following simple meaning: although large winnings rarely 
occur for Peter, they influence the average so strongly that, until the next 
large winnings occur, it is not essentially lowered by the many small winnings 
accumulating. Owing to the strong increase of the powders, the amount of the 
winnings increases more strongly than the corresponding probability decreases. 

If we use this consideration for a translation of the statement about the 
infinity of the mathematical expectation into a statement about finite quan¬ 
tities, the solution of the Petersburg paradox may be expressed as follows. 
If Paul accepts the game for any finite amount to be paid by Peter in each game 
as his stake, Paul will lose in the long run when the game is continuously repeated. 
With this formulation the paradox disappears. 
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The Petersburg “paradox” is related to a familiar gambling system. A per¬ 
son bets $1 on a certain result; if he does not win, he places $2; if he loses 
again, he bets $4, and so on. Thus after every loss he bets twice the amount. 
By this system he must win eventually, because his result must once occur. 
But the gain is not very high; it is equal to the stake of the first bet—$1. 
This follows from the relation 

n 

£ 2' = 2 n+I - 1 (19) 

i — 0 

Such a gambling system is not very lucrative. However, it carries extraordi¬ 
narily high risks; since every gambler possesses only a finite amount of 
money, he may be unable; to continue the game and must suffer a complete 
loss. The system is safe only for a gambler with an infinite bank account. 
If the Petersburg game is played with a finite stake for Peter, he can be cer¬ 
tain of winning only if he possesses infinite capital; otherwise he may be 
forced to stop after he has lost all his money. Such an eventuality is not im¬ 
probable, even for a millionaire; so this game, too, offers a high risk. 

I cannot share the opinion that the Petersburg problem creates logical 
difficulties for the theory of probability. That special conditions are required 
for the convergence of an infinite summation, such as occurs in (12, § 35), 
seems quite natural. Strangely enough, we find in the literature peculiar ideas 
about the Petersburg problem and faulty attempts to solve it. Thus it is 
argued that it is not permissible to extend each single game until the event 
B occurs, and that the game must be limited by an upper limit v for the group 
length X, since otherwise the individual game would “possibly never end”. 
This conception is erroneous; as soon as a probability is interpreted as the 
limit of a frequency, there must always exist a finite X such that B occurs. 
Furthermore, it is not the infinity of the summation that makes the sum in 
(16) become infinite; the nonconvergence results rather from the specification 
of the amounts u\ given in (15). This may be seen from the fact that different 
specifications of u \—for instance, the specification (2)—lead, even for infinite 
summation, to a finite mean value, as shown in (11). Thus if Peter always 
receives X dollars when B occurs in the X-th throw, his fair stake would be 
1 

= - = $2, according to (11), and the game would not differ in principle from 

other games, though neither the length of a game nor Peter's winnings are 
subject to an upper limit. 


§ 37. The Dispersion 

The average represents only a first comparatively rough characterization of 
the amounts u\ The characterization can be made more precise by stating also 
the dispersion. 
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Every single amount u ’ deviates from the average M (a) by the amount 

8u { = u* — M{u) (1) 

The 5u 1 are called deviations. If the 8u { are small, the average is a relatively 
good substitute for the totality of values; if they are great, the average char¬ 
acterizes the totality insufficiently. Since it is cumbersome to state the totality 
of the 8u\ we replace the 8u { by a mean value. 

The average of the 8u { themselves is not a suitable characteristic, however, 
because it vanishes. If we define the possible deviations analogous to (1) 

8u m = Um — M(u) (2) 

we obtain, with (9, § 35) and (12, § 35), for the average of the deviations the 
relation 

r 

M(5u i ) i = M(Su m )m = Z P(A,B m ) • Su m 

m — 1 

= Z ■ u m - M(u) ■ Z P{A,B m ) = 0 (3) 

m *=* 1 m *=> 1 

The corresponding statement can be asserted, by the way, for the mean 
value M n (8*u i ) i of the deviations 5*?^ from the mean value We write 

5*w 1 ' = — M n (u i ) i (4) 

With (8a and 11a, § 35) we arrive at the result 

r 

M n (8*u i y = M n (8*u m ) m = Z F n (A,B m ) ■ 5*u m 

m — 1 


r r 

= Z) F n (A,B m ) ■ u m - M n (u m ) m ■ Z F n {A,B m ) = 0 (5) 

m 1 m — 1 

That the average of the is = 0 results from the fact that the 8u { are 
partly positive and partly negative, and the two sums cancel each other. To 
free ourselves from this consequence we must form a mean value that does 
not depend on the sign of the individual elements 8u\ The absolute amount 
| might be used for such a mean value, but since this quantity is not easy 
to handle, the value which is likewise independent of the sign, seems 
preferable. The average of these quantities, for which the symbol A 2 is intro¬ 
duced, obtains, according to (12, § 35), as 

r 

A Ku) = M(hhi) = Z P(A,B m ) ■ S 

m=* 1 

Mu) = VM(5 2 m) = J Z Pi.A,B n ) ■ Vu m 


( 6 ) 
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This quantity is called the dispersion. More specifically, A 2 is called the 
quadratic dispersion; A the linear dispersion. The usual name for A is standard 
deviation; for A 2 , variance. In mathematical calculations A 2 is more useful, 
but for all numerical evaluations A is employed since it is of the same order 
of magnitude as the 8u m . The sign of the root in the expression for A(u) is 
always taken as positive. The choice of the dispersion as defined in (G) for 
the abbreviated characterization of the deviations of the u m must be regarded, 
from a logical standpoint, as a convention, comparable to the choice of the 
average. The dispersion, of course, cannot characterize the deviations 
exhaustively. 

A statistical definition may be added to the theoretical definition (6) of 
the dispersion. Employing the definition (4), we define the mean square of 
deviations A n2 (?/*)*, restricting ourselves for reasons of simplicity to the case 
of a compact sequence A, in the form 

A nS (u i ) i = - V S* 2 u { 

n ,_i 

= ; £ s* 2 u m • £ v(B‘ m ) 

n m=l i =1 

= £ F n (A,B m ) ■ [u m - (7) 

m=>= 1 

The dispersion is defined as the limiting value of the mean square of devia¬ 
tions. According to the familiar rules for limits, this limit is constructed when 
we introduce in (7) the limits for F n and M n : 

A Ku) = lim A»*(«<)< = £ P(A,B m ) • [u m - M(u )]* (8) 

n -+ oo m — 1 

The statistical definition, therefore, leads in (8) to the same value as the 
theoretical definition in (6). 

A few simple mathematical relations concerning the dispersion may be 
noted. First we have, from the definition (G), 

A 2 (k -u) = K 2 • A 2 (u) (9 a) 

A (k • u) = [ k | • A(w) (9b) 

By k we denote a constant factor. Second, we compute the change in the 
value of the dispersion arising when the deviations 8u m and 8u\ respectively, 
are not referred to the average M(u) but to any other constant value u 0 , that 
is, when we employ the deviations 5 0 u m and 8 0 u\ which are defined by 

8 0 u' — u' — u 0 


8 {)U m — u m u Q 


( 10 ) 
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§ 37. THE DISPERSION 
Using the relation 


<5(iU m = \u m — M (u)] + [M (u) — w,„] = 8u m — 5w 0 (11) 

we derive for the dispersion 1 calculated in terms of the deviations ( 10 ) 

A l(u) = M(8lu m ) m = M(8u m - 5m„) 2 

T 

= 23 P(A,B n ) ■ \8 2 u m — 2<5'U,„5t/ 0 + 5 2 u 0 ] 

m =“1 

= M(S 2 w„,) m — 25 m () • M(8u m ) m + 6 2 w 0 
With (3) and (0) the equation assumes the form 

Aofw) = A 2 (u) + [A/(«) - Mo] 2 (12) 


This simple relation may be called the theorem of the shift of the reference point , 
or shift theorem , in a notation introduced by von Mises. The dispersion A 2 (u) 
referred to the average is a minimum when compared to dispersions referred 
to other values u 0 . For in (12) there is added to A 2 (u) the always-positive 
term [M{u) — u {) } 2 , which vanishes only for w () = M(u). 

If we put in ( 10 ) u 0 = 0 , we have Aq^) = M(u 2 ); and solving ( 12 ) for A 2 (u) 
we obtain the relation = ( 13 ) 


The relations ( 12 ) and (13) hold even for the quantities A n2 and M n , since 
the derivation given remains valid w T hen v r e introduce these quantities in 
place of A 2 and M and replace P(A,B m ) by F n (A,B m ). Thus w r e have 

AS 2 («) = A" 2 (m) + [MHu) - W„P (14) 


A n2 (v) = M n (u 2 ) - M n2 (u) 


(15) 


There exists a certain probability that plays a characteristic part in the 
distribution of the amounts u\ If we envisage a new element of the sequence, 
its value u\ which is not yet knowm, depends on which of the B m wall be 
realized. Now we can ask for the probability w& that the amount u { does not 
deviate by more than <5 from the average M(u ). 

For the probability w$ it is possible to derive a characteristic inequality. 
For this purpose the amounts u m are divided into tw^o classes: to class i belong 
all the amounts for which ^ 8; to class n belong all the other u m . Formu¬ 
las (G) can be divided in two partial sums by extending the summation in 
the first partial sum over the amounts of class i, and in the second partial 
sum over the amounts of class ii: 


A 2 (m) = 23 P(A ,B m ) ■ 8 2 u r 

m — 1 


+ 23 P(A,B m ) 

m =1 

II 


5 2 u„ 


( 16 ) 


1 We write M(u + v) 2 instead of M([u + a] 2 ), and M 2 (u + v) for \M(u + a)] 2 . 
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Since none of the terms is negative, the right side of the equation will either 
diminish or remain the same when we drop the partial sum i; the partial 
sum ii is further decreased (or left unchanged in the limiting case) when we 
write in it everywhere 8 2 instead of 8 2 u mj for in this sum we always have 
|$w m | > 5. Therefore we have 

A 2 (w) S • Z P{A,B m ) (17) 

m — 1 
II 


But the sum of the probabilities is equal to the probability of |5w m | being 
greater than 8, that is, it equals 1 — By solving the inequality (17) for 
Ws we thus derive 


w d ^ 1 — 


A \u) 
8 2 


(18) 


This important inequality is called Tchebychev's inequality in honor of the 
mathematician who first formulated it. It establishes a relation between the 
desired probability w* and the dispersion A 2 {u). Conversely, the dispersion 
A 2 (u) acquires by this inequality a special importance, as it determines a 
lower limit for the probability w d that an amount u { does not deviate from 
the average by more than 5, But the inequality is of practical interest only 
if the limit 8 considered is greater than the dispersion, since for 8 g A (u) 
the inequality states merely the trivial fact that w 8 ^ 0. Nevertheless, the 
inequality will be used later for an important purpose. 


§ 38. Average and Dispersion for a Combination of Events 

We extend the concepts developed to cover the case of several sequences 
that lead to combinations B m .Ci and thus to the combination of amounts. 
Assume we have the two disjunctions Bi V . . . V B r and C\ V . . . V C ty 
which belong to different sequences and to which the amounts Ui ... u r 
and Vi . . . v t are coordinated; the amount w m i may be coordinated to a com¬ 
bination B m .Ci. For this amount w mi we make the special assumption that 
it is composed additively of the amounts u m and v h that is, 


Wmi = u m + Vi ( 1 ) 

We calculate first the average M(w), which must be written in the detailed 
notation according to (86, § 35): 

M(w ) = M(A,B m .Cf,w ml ) ml = E £ l\A,B m .C{) ■ w ml 

771-1 l~l 


( 2 ) 
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or with (1), when we divide the whole expression into two terms and factorize 
the probability of the logical product according to the two forms of the rule 
of the product: 

M(u + v)-± E P(A,B m .C t ) ■ (u n + »,) 

m-1 I-1 

= E P(A,B m ) ■ u m • E P(A.B m ,C,) 

m =>1 Z =1 


+ E^,C|) • • jZP{A .C h B m ) 

Z=*l to —1 

= Jlf(w) + Jlf (t>) (3) 

Here we have used the relation holding on account of the completeness of 
the disjunction analogous to (9, § 35): 

E P(A . B m ,Ci) = 1 £P(A. C t ,B m ) = 1 (4) 

Z<=1 m==l 

The average is therefore additive when the amounts are additive. 

For the sake of clarity, (3) may be written with the statement of the sub¬ 
scripts or superscripts of summation. We have two ways of writing: 

M{u l + v') { = M(u i ) i + M(v i ) i (3a) 

M (u m + Vi) m i = M(u m ) m + M(vi)i (36) 

The first notation results from the statistical, the second from the theoretical, 
definition of the average. The plus-sign in the argument of the M-symbol on 
the left side of (3) has only a symbolic significance. If we wish to transform 
its meaning into an arithmetical significance, as on the left sides in (3a) and 
(36), the two different subscripts rn and l must be introduced when we use 
subscripts, as in (36). 

It is easily seen that additivity corresponding to (3) holds even for the 
mean values M n ; a proof results by writing everywhere F n instead of P in 
the derivation given for (3). We have, therefore, 

M n (u + v) = M n (u ) + M n (v) (3c) 

Equation (3) was derived by general methods without any assumption 
concerning the dependence or independence of B m and Ci. If we wish to derive 
an analogous result for the dispersion, however, we must make the assumption 
that the events B m and C* are mutually independent. We write with (3) 

$Wml = 6(u m + Vi) = Um + Vi- M(u + v) 

= Um - M(u) + Vi - M(v) = 8u m + 8vi 


(5) 
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and thus we obtain 

A 2 (m + V) = ££ P(A,B m .Ct) (5u m + to,) 2 (6) 

m *= l J=1 

Assuming the independence of B m and C ly we can factorize the probability 
according to the special theorem of multiplication. We therefore have 

A 2 (u + v) = £ P(A,B m ) ■ S-u m • £ P(A,Ci) 

m —1 i=l 

+ £ ^(4,0,) • 5 2 i-z • £ P(4,/?,„) 

1 1 m»l 


+ 2 £ P(A,B m ) ■ Su m - £ l>(A,C t ) ■ 8r, (7) 

The last term vanishes because of (3, § 37); with (9, § 35) and (6, § 37) we 
thus derive 

A 2 (u + v) = A 2 (u) + A 2 (r) (8a) 

A(m + v) = a/ A 2 (u) + £ 2 09 (86) 

The quadratic dispersion is additive for independent quantities when the 
amounts are additive. 

For the sake of clarity we may reintroduce the notation specifying the 
subscripts or superscripts of summation: 

A 2 (u* + v *) * = A 2 ^*)* + A 2 (v 1 ) 1 (8c) 

A 2 C u m + Vi)rni = A 2 (u m ) m + A 2 (^)/ (8d) 


Formula (8c) corresponds to the statistical, (8d) to the theoretical, definition 
of the dispersion. 

The relation (8b) expresses the important law of the spreading of the disper¬ 
sion. Because of the inequality 

VA 2 (u) + A 2 (F) < A(u) + A(t>) (9) 

which is easily derived on the condition that the dispersions do not vanish, 
we haVe A(u + v) < A(«) + A(») (10) 


For nonvanishing dispersions, the linear dispersion of the sum is smaller than 
the sum of the linear dispersions; thus the dispersion does not increase at 
the same rate as the amount. The result can be explained as follows. The 
fluctuations of the amounts u* and v l were assumed to be mutually inde¬ 
pendent; therefore an extreme value of u { will rarely coincide with an extreme 
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value of v\ The extreme deviations of one amount will thus be smoothed out 
or even compensated by the behavior of the other amount. 

If, for instance, u { and v l represent each the number of passengers in a 
streetcar, counted every day at a specific time in a certain street, fluctuations 
in the number will occur. The* total number of persons in both streetcars will 
then fluctuate a little more than in each car separately; the deviations, how¬ 
ever, will not be twice as strong, but will grow according to (8b). Thus, when 
the dispersions A(u) and A(v) are assumed to be equal, the total number 
will merely fluctuate y/2 times as much as the number in one car. If only 
a few passengers happen to lx* in one car, then, in general, the same will not 
be the case in the other car; and so the fluctuations are compensated. Formula 
(8 b) in combination with (10) is therefore often called the law of the compen¬ 
sation of the dispersion. The independence of the fluctuations is essential for 
this result. If a large deviation in the number of persons in one car were 
always linked to a large deviation in the other car, no compensation would 
result, and the fluctuations of the total number of persons would be twice as 
great as those of the individual number. This simple consideration may ex¬ 
plain why the relations (8) can be derived only if the amounts are assumed 
to be independent. 

The case of extreme dependence can be illustrated through the preceding 
example by reference to the amount of fares received. If each passenger pays 
10 cents, the number of cents paid is 10 times the number of persons; the 
fluctuations in the amounts of money will then be 10 times as strong as the 
fluctuations in the number of persons. No compensation results here, because 
the amounts of money are linked strictly to the number of persons present. 
The example corresponds to the relation formulated in (96, § 37). 

§ 39. Average and Dispersion in the Lattice 

The principle of the compensation of the dispersion, in contradistinction to 
the multiplication of the dispersion as expressed in (96, § 37), can be illus¬ 
trated even better when the relations (8, § 38) are transferred from two to a 
greater number of amounts, which belong to further sequences. In order to 
have a suitable notation, we no longer denote the amounts by u m and v h or 
u { and v\ but write u k m and u ki > thus indicating by the first superscript that 
the amount belongs to the sequence k. The second superscript refers to the 
element i of the sequence k ; the subscript indicates the value m of the amount 
belonging to the sequence k. As before, the second superscript can be inter¬ 
changed with the subscript in summation expressions. The number of se¬ 
quences may be n. The sequences constitute a lattice, which is finite in respect 
to the dimension of the first superscript and infinite for the dimension of the 
second superscript. 
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The dispersion of the mean value of the u k m may be calculated, starting 
with the statistical definition of the mean value, which is illustrated with 
the help of the schema 


u 11 

u 12 

u' 3 . . . 

u' 

a 2 <y) 

u 21 

U 22 

w 23 . . . 

w 

A 2 (u 2 ) 

u kl 

u k - 

u k3 . 

22* 

A 2 (w*) 

u nl 

W" 2 

14 n3 . 

fin 

A 2 (w") 

0 l 


P 3 . . . 

A 

0 

A 2 0) 


The horizontal rows represent the individual sequences. In the vertical direc¬ 
tion the mean value 0* of the u ki is indicated below each column; in the 
horizontal direction each individual sequence has a mean value u k . We thus 
have the notation 


22 * = M(u*0* 

A 2 (w*) = M((u ki - H k Y) *‘ 

(2a) 

1 ” 

0* = M n (u ki ) k = - • X) uki 

n k -1 


(26) 

£ = = M{M\u ki ) k ) i 

A 2 0) = M(0* - £) 2 )* 

(2c) 


We begin by calculating the average $ of the mean values For this pur¬ 
pose we make use of the relation 

M(±u k V = YjM{u ki y (3) 

*-i *-i 

which is easily derived from (3, § 38) by generalization. With (13, § 35) we 
then obtain 

= - • M(f>**)* 

= - • £ M(u ki )‘ = M n (M(u ki ) *)* (4) 

» k-i 

or, by dropping the superscripts, 

M(M”(u )) = M"(M(u)) 


(4a) 
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Average and mean are therefore commutative operations. Using the notation 
employed in schema (1), we write for (4) 

P = = M n {A k ) k (4 6) 

This means that the quantity p, which represents the average of the last row, 
can simultaneously be constructed as the mean value of the averages u 1 . . . H n 
presented in the column ending with p. 

In order to calculate the dispersion of the P\ we regard the mean value 
M n as resulting from an addition of amounts. Now we can easily see that the 
law of the addition of the quadratic dispersion (8a, § 38) can be extended 
to apply to further amounts if they are additive and if the sequences are 
completely independent (§ 23). We thus derive the generalization of (8a, § 38) : l 

AH'Zu ki ) i = £ A 2 (w (:i ) i (5) 

k=°-l k-1 


obtaining, with the help of (9a, § 37) and (26), 

A\M n (u ki ) k ) i = — • A 2 (E u*0‘ = 4 • Z A 2 (m**)’ 

« 2 *- 1 « 2 k- 1 


= - • M*( A*(u*0‘)‘ 
n 


( 6 ) 


By dropping again the superscripts of summation, we write (6) in the form 


A 2 (M"(w)) = - • Af"(A 2 (u)) 
n 


(6a) 


The quadratic dispersion of the n-fold mean value is only the n-th part of 
the n-fold mean value of the quadratic dispersion. Average and dispersion 
are noncommutative operations; if the operations are interchanged, the factor 
1 

- is to be added. In the presentation of schema (1) this result means: the 

quantity A 2 (p) does not constitute the mean value of the amounts A 2 (n*) in 
the corresponding column, but represents only the n-th part of this mean 
value. In this respect the dispersion differs from the average. With the notation 

s 2 = M”(A 2 (n)) <7 2 = A 2 (M n (n)) (66) 

1 1 should like to draw attention to a possible fallacy. If the amounts u & are equal for 
the different values of k, that is, if corresponding amounts are the same in the individual 
sequences, we might be inclined to infer from equation (5) the relation A 2 (n *u) — n • A 2 (t/), 
in contradiction to (9a, § 37). The inference is fallacious, however. From the notation 
(8d, § 38) we see that amounts with different subscripts are to be added; the inference, 
therefore, is impossible. Furthermore, we cannot derive uft = u n from i4i = u*m, so that it 
is also impossible to make the inference when the notation of the superscripts is used. 
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Number of Passengers at a Railway Station 
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we may write for (Ga) 

1 1 

a 2 = — • s 2 a = r- • «s* (Gc) 

n V n 

'The quantity s 2 can be regarded as a characterization of the dispersion of 
the whole lattice of tin* n kl . The quantity a 2 characterizes only the? dispersion 
of the mean values /T; the fluctuations of the individual u ki within the vertical 
columns do not find an expression in the quantity a 2 . But for independent 
sequences both quantities are connected by the relation that the quantity <j 2 
is the 72-th part of the quantity s 2 . Correspondingly, the linear dispersion or 
of the mean value is the \/n-t h part of the linear dispersion s of the whole 
lattice, and therefore we speak of the \/Ti-law of the dispersion. 

This law, which is a generalization of (8 b, § 38), is an impressive manifesta¬ 
tion of the compensation of the dispersion. It formulates the assertion that 
the mean value of a number of independent quantities fluctuates the less, the 
more quantities are concerned. The law possesses an extraordinary importance 
in the physical world. It entails the consequence that the phenomena of fluc¬ 
tuation exhibited by microscopical and submicroscopical particles are not 
observable for macroscopic objects. Since the elementary processes are mu¬ 
tually independent, or at least very nearly so, their fluctuations compensate 
one another to a high degree. That we can speak, for instance, of the tem¬ 
perature of a macroscopic body and determine the quantity objectively by 
exact measurements derives from this law: for 1 cc. of a gas the number n 
of molecules is about 10 18 , and even if we assume a dispersion s of 100% 
for the kinetic energy of the individual molecule, the dispersion of the mean 
value of all these kinetic energies, which corresponds to the temperature, 

levels off to cr = - *£ --%, that is, to one ten-millionths of 1%. The probability 
V 10 18 

law of the compensation of the dispersion formulates the very law of nature to 
which we owe the uniformity of the macroscopic world. 

This law may be illustrated by an example from everyday life, which at 
the same time makes it evident that the condition of independence is indis¬ 
pensable. At a railway station the number of passengers occupying each car 
was counted for every passing train. 2 Most of the trains had six cars; only 
three of the twenty-four trains counted had eight cars, in which case, instead 
of the fifth and sixth cars, the seventh and eighth were substituted for the 
statistics. The number of passengers is given in table 2 A (p. 198). Each ver¬ 
tical column corresponds to one train; each horizontal row represents a certain 
car defined in terms of the same location in different trains. The first and the 
last car car ry more passengers than the others, owing to the fact that they 

2 These observations were compiled in 1933 at a station of the Berlin Metropolitan 
Railway, which connects the various parts of the city. 
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stop nearer the exits of the station. The cars in the middle have fewer pas¬ 
sengers because they contain second-class compartments, for which the fare 
is higher. The trains exhibit a characteristic occupancy , that is, some trains 
are fully occupied and others are almost empty; the differences result from 
the fact that the trains come from different directions or branch off later. 

For this table the quantities u k are constructed by taking the mean values 
horizontally; the fi 1 are constructed as the vertical mean values. By the use 
of these values the quantities s and a are calculated. We thus find the value 
c — 9.13, which is only a little lower than the value 6* = 10.91. The smallness 
of the difference is explained by the fact that the relation (6c) is not applicable, 
because, owing to the characteristic occupancy of the trains, the horizontal 
rows are not mutually independent. Therefore we find only a weak compensa¬ 
tion of the dispersion. 

In order to obtain independent horizontal rows, the numbers of each hori¬ 
zontal row were written on slips of paper and the slips were thoroughly 
shuffled for each row separately. The result of the shuffling is given in table 
2 B. The horizontal mean values are the same as in table 2A, since every 
horizontal row as a whole remains unchanged. The vertical mean values are 
changed essentially, however; it can be seen at a glance that they fluctuate 
much less than the values of the individual rows. Thus the extreme values 
are given by 31.5 and 11.2, whereas the extreme values of the total lattice 
are 59 and 2. Correspondingly, the calculation of a leads to a smaller value 
than before; we find a = 4.68. Since s is the same as in table 2 A, we would 
obtain from (6c) for w — 6 the value a = 4.45, with which the observed value 
a — 4.68 is in good agreement. This example 3 confirms the law of the com¬ 
pensation of the dispersion. 

3 The German edition of the book contains at this place a presentation of three kinds of 
dispersion definable for a lattice, a distinction which clarifies a puzzle connected with the 
dispersion of the mean value, namely, the problem of whether the observed dispersion should 
be divided by n or by n —1. I refer the reader who is interested in these problems to the 
German edition. 
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CONTINUOUS EXTENSIONS OF THE 
CONCEPT OF PROBABILITY 
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§ 40. The Geometrical Interpretation of the Axiom System 

The axiom system of the theory of probability was constructed in such a 
manner that it could be treated as a formal system of implicit definitions 
without the use of any interpretation of the concept of probability. An inter¬ 
pretation was introduced later in the form of the frequency interpretation. 
Since all formal systems of axioms have various admissible interpretations, 
we can coordinate interpretations of different kinds to the axiom system of 
probability. An admissible interpretation, which has no connection with the 
frequency interpretation, will now be developed. 

As before, we understand, by A,/i,C, classes, or sets, but we do not relate 
them to sequences; the x, and i/i are elements of classes for which no definite 
order is assumed. However, a one-one correspondence of the x t , y x , and z t -, 
expressed by the subscript, is assumed as before. As explained in § 7, we 
understand by the logical sum of two classes their joint class, or narrower 
couple disjunct; by the logical product, their common class, or narrower 
couple conjunct, depending on whether we are concerned with classes of the 
same or separate domains of elements. The symbols “or” and “and” thus 
assume the meaning that they have in the calculus of classes. 

We simplify the investigation by assuming the classes to be classes of 
geometrical points in a plane, thus arriving at a geometrical interpretation of 
the axiom system. The classes are then given by geometrical areas. This inter¬ 
pretation also facilitates the introduction of a measure, which we need for 
this interpretation, in that it permits us to understand by the measure M(B) 
of a class B the size of the area B . The presentation will be further simplified 
by assuming the elements x t , ?/;, z t , carrying the same subscript, to be identical. 
Such a restriction to what was called “internal probability implication” in 
(5 and 6, § 9) actually does not impair the generality of the following con¬ 
siderations, since the extension to pairs of nonidentical elements can easily 
be carried through. The classes are illustrated in figure 8; the measure M(A) 
is represented by the size of the area denoted by A, and, correspondingly, 
the measure M ( B ) is represented by the size of the area denoted by B. The 
common class A .B is indicated by shading; the joint class is given by the 
area covered by A and B together, for which the shaded area is counted 
but once. 


[203] 
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By means of the measure of the classes, the probability coordinated to 
both classes is now defined in the form 


P(A,B) 


M{A.B) 

M(A) 


( 1 ) 


The 'probability from A to B is defined as the ratio of the measure of the common 
part of both classes to the measure of class A. This coordinateve definition is 
used instead of the frequency interpretation for the concept “probability” 
occurring in the axiom system. 



Fig. 8. Geometrical interpretation 
of probability concept b}' measure 
of areas. 



Fig. 9. Simplified geometrical interpreta¬ 
tion of probability concept : M{A) — 1, B and 
C lie within A. 


For formulas all containing A in the first term it is advisable to choose a 
measure such that M(A) becomes equal to 1; besides, the conditions are 
simplified by assuming the classes B and C as lying completely within A. 
This simplification is permissible because we use only the parts of B and C 
that are situated inside A. The resulting relations are illustrated in figure 9, 
in which the class A is symbolized by the rectangle. We thus arrive at the 
following interpretation of symbol combinations occurring in the formulas: 


P(A,B) = M(B) 

P(A,C) = M(C) 

P(A,B.C) = M(B.C) (2) 

P(A,B V C) = M(B V C) 


P(A.B,C) 


M(B.C) 

M(B) 


The geometrical interpretation also illustrates a certain peculiarity of the 
calculus of probability that was mentioned above (§21): that the probability 
relation existing between two classes B and C relative to a third class A are 
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determined by three fundamental probabilities, namely, the probabilities 
P(A,B), P(A,C), and P(A.B,C), the latter being replaceable by the prob¬ 
ability P(A,B.C). That the two probabilities P(A,B) and P(A,C) are not 
sufficient is seen from figure 9; the size of the areas B and C does not deter¬ 
mine the size of the area of their mutual overlapping. The areas can be moved 
individually, and only when their degree of coupling is given, for instance, 
by the size of their common area, is their mutual position determined as far 
as necessary for the computation of all their probabilities. Furthermore, the 
geometrical interpretation supplies an instructive illustration of the general 
theorem of addition. The area of the joint class B V C is equal to the sum of 
the areas of B and C diminished by their common area; otherwise the latter 
would be counted twice. 

It is easily seen that all the axioms i-iv are satisfied by the geometrical 
interpretation. The logical implication A D B y occurring in axiom ii,1, denotes 
the relation of class inclusion according to (21, § 7), which in the geometrical 
interpretation means that the area A lies completely within the area B. For 
this case the definition (1) leads to the result P(A y B) — 1, in agreement with 
axiom ii, 1. Axiom i asserts that the ratios of the areas become indeterminate 
only if the area A is zero. That the other axioms ii,2~iv are fulfilled can easily be 
verified by similar considerations. Only the axioms v do not find an inter¬ 
pretation, since in the geometrical interpretation it is impossible to find an 
equivalent of the phase symbols. This impossibility results from the fact 
that the elements of the geometrical point sets are arranged in a continuous 
order, whereas the elements of probability sequences constitute discrete 
sequences. In the geometrical model, therefore, the theory of the order of 
probability sequences cannot be represented. All other probability relations, 
however, find a geometrical interpretation in the model given. 

§ 41. Definition of Probability Sequences with Continuous 

Attribute 

The geometrical interpretation of the concept of probability is connected 
with an extension of the concept of probability sequence now to be explained. 

The extension may be illustrated by the example of shooting at a target. 
The individual hit, that is, the element Xi of the probability sequence, is 
characterized by a point on the target; the classes A and B are then areas 
of the target. For instance, A may represent the total area of the target; 
B may be a certain inner circle around the center of the target (as shown in 
fig. 10, p. 206). If we determine statistically the probability P(A,B )—that is, 
the probability that a shot fired at the target A hits the inner circle B —it is 
sufficient to know about each hit whether it lies within the areas A or B. 
But if we specify the exact location of the hit on the target we know more 
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about the element Xi than merely whether it belongs to A or to B. From this 
more detailed statement we can sec whether the hit is contained in A or in B; 
and, moreover, we can determine whether the hit belongs to any other area 
of the target that we may wish to consider. The precise location at which 
the shot has hit the target is called the attribute of the event x L . This location 
is characterized by the use of two coordinates u,v on the target. The mani¬ 
foldness scaffolded by the coordinates u,v is 
called the attribute space; in the illustration it is 
given by the target. If we know for each x t its 
attribute, or its point in the attribute space, 
we can determine the probability of a hit for 
any arbitrarily chosen area of the target. 

The probability for such areas A and B is 
determined through the frequency interpreta¬ 
tion; we count the number of hits according to 
(3, § 16), that is, we take the ratio of the num¬ 
ber of hits for the common class A .B to the 
number of hits for A. The ratio will usually not 
be equal to the ratio of the corresponding geo¬ 
metrical areas. But the considerations of § 40 
show how to proceed. It was explained there 
that the axioms of the probability theory can 
be interpreted by ratios of areas, and we must 
therefore be able to introduce on the target a 
measure function M(G) for areas G such that to a target. Top view shows tar- 

the measures coordinated to the areas A,B, A .B function corresponding to the 
determine the probability P(A,B) in the sense probability, according to (1) and 
of (1, § 40). ^ * (4; ‘ 

It is convenient to assume that the measure function M(G) can be expressed 
as the integral of a scalar function <p(u,v) taken over the domain G : 



Fig. 10. Probabilities referring 


M(G) = JJ <p(u,v)dudv 


( 1 ) 


This way of writing the measure function is possible when the function 
M(G) possesses certain limit properties for (?->- 0. These conditions can be 
expressed in various ways. I shall use the following form: assume that for 
every series of boxed-in areas Gi ... G n ... , which contract toward a point, 
the limit 

hm —— (2) 

n-*® \jr n 


exists and is a continuous function of the point. By “boxed-in” is meant that 
every G n includes the following area G n + 1 ; by “contraction toward a point”, 
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that lim G„ is a point. The limit (2) defines the value of the function <p(u,v) 

n—* co 

at the point considered. 

Since the measure function M(G) is to supply the probability of the area, 
a condition corresponding to (2) must bo satisfied by the probability values 
of the sequence, that is, we must require that for every series of boxed-in 
areas B\ . . . B n . . . , which contract toward a point, the limit 


lim 

n-+ oo 


P(AJK ) 

Bn ~ 


( 3 ) 


exists and is a continuous function of the point. The condition (3) may be 
regarded as an axiom that probability sequences with a continuous attribute 
must satisfy, in addition to the axioms i -v. This axiom, which may be called 
the axiom of continuity , has no analogue for sequences with a finite number 
of attribute classes, B { . . . B n , and therefore does not belong in the ele¬ 
mentary calculus of probability. 

If the condition (3) is satisfied, the statistical probability may be repre¬ 
sented by the expression rr 

I I <p(u,v)dudv 

P(A,B) = -- (4) 


Jl 


<p(u,v)dudv 


For reasons of simplicity, the measure function M is chosen so that M(A) — 1; 
then (2, § 40) gives 

P(A,B) — jj <p(u,v)dudv (5) 

If <p is introduced as the third coordinate perpendicular to the plane u,v, the 
measure function tp of the target, according to general experience, assumes the 
form of a bell-shaped surface corresponding to the front view of figure 10 
(p. 20G) and resulting from rotation of the curve around its vertical axis. 
According to (1), the measure M(G) coordinated to an area G of the target 
is presented by the volume of the cylinder erected above G, the upper bound¬ 
ary of which is given by the bell-shaped surface. For instance, the measure 
coordinated to the area B in the top view is the cylinder that is produced by 
a rotation of the shaded area in the front view. 

A probability represented by an integral over an area, as symbolized in 
(4) and (5), is called geometrical 'probability. The possibility of a geometrical 
representation of probabilities results from the considerations given in § 40. 
By showing that both the frequency interpretation and the geometrical inter¬ 
pretation satisfy the axioms of the formal system of probability, that is, are 
interpretations of this system, we have demonstrated the isomorphism , or 
structural identity, of the two interpretations. Every operation carried out 
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in terms of probability formulas entails analogous operations in the frequency 
interpretation and the geometrical interpretation. Any derived probability 
relation is, therefore, symbolized in the geometrical interpretation by those 
geometrical relations that have been specified above for the geometrical inter¬ 
pretation of the probability concept. For instance, the probability P(A,B V C) is 
determined by the measure of the joint class of B and C when we use the same 
measure function <p that was introduced in the plane for B and C separately. 

The foregoing considerations can easily be generalized for an attribute 
space of more than two dimensions. The attribute coordinated to the element 
xi of the probability sequence is then a point in a multidimensional attribute 
space u,v, . . . 

Probability sequences of a continuous attribute are more general than the 
sequences discussed in the preceding chapters, so far as the classes A,B,C are 
not given constants of the sequences, but constitute variables. I call such a 
sequence a 'primitive probability sequence; I wish to express by this name that 
the sequence represents the root of a number of different probability sequences 
of the usual kind, or classified probability sequences, each of which results for 
some division of the attribute space. The transition from the primitive to the 
classified probability sequence, which is determined by the statement of the 
areas A and B, may be called classification. Classification is an operation by 
which the statement of the precise attribute coordinated to every element x, 
of the sequence is replaced by the weaker statements of whether the element 
belongs to A or to B, respectively. 

The possibility of extending the concept of probability sequence to that of 
primitive probability sequence, that is, of a sequence with variable classes, 
derives from the isomorphism existing between the geometrical interpretation 
of probability and the frequency interpretation. To this isomorphism we owe 
the existence of geometrical probabilities. The function <p determining the 
metric of the attribute space is called a probability function; it may also be 
called the distribution (a term introduced by von Mises) of the primitive 
probability sequence, since it determines the distribution of the attributes of 
the sequence. 

As for sequences with coordinated steps of amounts, it is often convenient 
for primitive probability sequences to forget about the event sequences, or 
thing sequences, to which they refer, and to consider directly the sequence 
of number combinations, that is, of attribute points, given by the attribute 
sequence. This point sequence has the character of a probability sequence in 
which the classes A and B , the members of which are attribute points, replace 
the classes of events, or things. 

The incorporation of primitive probability sequences into the axiomatic 
calculus of probability offers no difficulties. We can transfer the axioms i-v 
of the probability calculus to primitive probability sequences. Such a transfer 
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is achieved by the condition that these axioms are to be valid for every classified 
sequence that can be derived from the primitive probability sequence. In this 
manner we can incorporate the theory of primitive sequences into the formal 
calculus of probability without employing the frequency interpretation. The 
theory of order can be transferred in the same way; for instance, we define 
the normal primitive probability sequence by the condition that any classified 
sequence derivable from it be a normal sequence in the sense of § 30. 

The only addition to the axiom system that is required for primitive prob¬ 
ability sequences is the postulate of continuity (3). An important consequence 
of this axiom, which greatly simplifies operations with an infinite number of 
attribute classes, may be explained: the satisfaction of the limit postulate (3) 
leads to the commutativity of the limit operation and the probability oper¬ 
ator, that is, to the relation 

lim P(A,B n ) = P(A , lim B n ) (G) 

n-*oo 

The proof of this relation follows from the theorem, derived in the theory 
of integration, according to which the integral is a continuous function of its 
boundaries. This theorem holds even for certain places where the function 
<p is discontinuous; in fact, we are often concerned with probability functions 
that jump discontinuously from one value to another at individual points. 
For such points the postulate (3) is to be formulated somewhat differently 
by the distinction of an approach from one side or the other. 

Although the postulate of continuity (3) is not derivable from the frequency 
interpretation, if offers no difficulties so far as the application of the theory 
of geometrical probabilities to physical reality is concerned. The postulate 
occupies a place similar to the logical position of the specializing conditions 
of the theory of order, such as the condition of absence of aftereffect. Whether 
the postulate (3) holds can be verified by empirical observation. How such 
verification is achieved will be shown in § 42, in the course of a study of the 
application of geometrical probabilities to practical problems. In this study 
of a chapter of the calculus of probability, which is of great practical impor¬ 
tance, the frequency interpretation will always be used. 

§ 42. Empirical Determination of a Probability Function 

It will now be explained how a probability function is found by statistical 
observations. Retaining the example of the target, imagine its area to be 
covered by a rectangular system of coordinates with the distances du and dv , 
respectively (fig. 11); each rectangular area may be determined by the state¬ 
ment of the coordinates u m ,vi, belonging to its lower left corner. By counting 
the number of hits for each rectangle we determine statistically the prob¬ 
ability Wfni of hitting the rectangular area u m Vi. If x t is an individual shot, 
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then Xi belongs to the class B mi if Xi makes a hit inside the rectangle u m v t ) and 
Xi belongs to the class A if hits the target at all. We have then (for the 
iV-notation see § 16) 

_ „ , N n (A.B m i) (1) 


In practice we cannot reach the limit, of course, but must break off after 
some large number of shots. Theoretically, however, the frequency interpre¬ 
tation requires the existence of the limit (1) as the 
condition that enables us to speak of the probability 
w mi . The transition from the observed number of hits 
to the limit represents the usual inductive inference, 
without which practical applications of the calculus 
of probability cannot be constructed (see § 17). After 
the existence and the value of the limit have been 
inductively ascertained for one rectangle, correspond¬ 
ing statements are made for the other rectangles. For 
a none-too-large number of rectangles this statement Flg ‘ ^‘a^target!^^ 8 ° n 
can be tested as for the first rectangle; the extension 
to further rectangles represents another inductive inference. 

Above the rectangle u m vi we now draw as the third, or vertical, coordinate 
a quantity (p* m i defined by 

ip mi ' diim • dvi = w m i ( 2 ) 



Thus we erect on top of the area u m vi a rectangular column with a volume 
equal to w m i. The totality of the columns, which has a staircase-shaped sur¬ 
face, is called a histogram . 

If the same construction is repeated for smaller rectangles du m dvi , the prob¬ 
abilities w m i become smaller, because a smaller area is hit more rarely; but 
the ordinates <p* m i need not change much thereby, since the smaller frequency 
is expressed by the decrease of the width of the column. Assume that the 
height of the column remains virtually constant, or, more precisely, that 
when the construction is repeated for smaller and smaller rectangles, there 
exists at every place the limit 

lim ~ ~~ ~ = <f(u m ,v t ) (3) 

du m -+Q aUmdVi 
dvl~*0 


and that this limit is a continuous function of the point u my vi. Of course, this 
assumption, including the existence of the limit (3), is not capable of a strict 
proof; but it can be inferred on the basis of the observational material by 
means of an inductive inference, an extrapolation of the observed regularities, 
corresponding to the inductive determination of the limit of the frequency 
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in an infinite sequence (see § 17). We thus arrive at an empirical verification 
of the postulate (3, § 41) that was introduced for primitive probability se¬ 
quences. This postulate does not impose general conditions on the physical 
world, but is treated like all other specializing conditions: we restrict its 
application to empirical material that satisfies the postulate. 

The introduction of the condition (3) allows us to replace the staircase¬ 
shaped surface by a smooth surface <p(u,v ); for finite areas B the condition 
means that the probability is determined by the double integral 

P(A,B) — (p(u,v) dudv (4) 


This means that the probability relations are characterized by a probability 
function <p in the sense previously specified. In the example of the target, 
the function <p(u,v) is represented by the bell-shaped surface drawn in the 
front view of figure 10 (p. 206). 

Although the relation (3) is not strictly verifiable, because the limit of an 
infinite sequence is not accessible to observation, the practical procedure 
reflects the double transition to a limit expressed in the two relations (1) 
and (3). The transition to smaller rectangles cannot be continued until the 
number of observed hits in each rectangle is sufficiently large; otherwise we 
would arrive at a wrong picture of the distribution. Theoretically speaking, 
this means that the two transitions to a limit expressed in (1) and (3) are 
not commutative. 


From (3) and (4) we see that <p does not possess the character of probability, 
which, rather, can be ascribed only to the integral over <p; the function <p is 
therefore called a 'probability density. In particular, <p(u,v ) is called a two-dimen¬ 
sional probability density , since only a double integration leads to a probability. 
Correspondingly, we speak of a one-dimensional probability density for a one¬ 
dimensional attribute space and, similarly, of multidimensional probability 
densities for attribute spaces of more than two dimensions. For small areas 


dudv we can regard the value 


(p(u,v) dudv 


as the approximate value of the probability. This notation becomes strictly 
correct whenever we proceed from such expressions to integrations, and we 
shall therefore use it for the sake of brevity in the following discussions. The 
notation clearly illustrates the fact that the probability goes toward zero if 
the area dudv goes to zero; the probability of hitting a precisely prescribed 
point, therefore, is = 0. It is this very fact that makes a density, as distinct 
from a probability. 

It is a general property of all probability functions that they satisfy the 


relation 



<p(u,v) dudv = 1 


( 6 ) 
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This integral represents the probability that the attribute point is located at 
some point within the attribute space, a probability which is = 1. We call 
(6) the condition of normalization for probability functions. 

The empirical determination of a probability function may be illustrated 
by an example. Table 3 (p. 213) presents the result of measurements of the 
height of American recruits. 1 A total of 25,878 individuals were measured. 
A division by intervals of one inch was made, and the number of persons 
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Fig. 12. Frequency distribution found in measurements of height of 
25,878 American recruits. 


falling into each interval was recorded. The last column of the table will be 
discussed later. 

The table is represented graphically in figure 12. We are dealing with a 
one-dimensional attribute space; so only two axes are required. The values u 
of the height are given by the abscissa, and as the interval du the value of 
one inch is chosen, in correspondence with the table. Above every interval 
I have drawn a rectangle the height of which is chosen so that the area of the 
rectangle equals the probability that a certain person falls into this interval. 
The probability is calculated by dividing the corresponding number in the 
table by the total number 25,878. If the scale of the abscissa is chosen so 
that du = 1, the height of each rectangle is equal to this quotient because 
<p(u)du represents the considered probability. I have used this scale for the 
ordinates inscribed on the left side of the diagram. One could also select for 
the ordinate a scale that makes the height of each rectangle directly equal to 
the corresponding number of individuals as given in the table. Then the length 
2 V/ 8 T 8 mus t be ascribed to the interval du. This scale is employed for the 
notation of the ordinates given on the right side of the diagram. 

1 This example is taken from Karl Pearson, The Chances of Death . . . (London, 1897), 
p. 276. 
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In this manner the histogram of figure 12 is obtained, which has the shape 
of a staircase. If the construction is repeated for a finer division by intervals, 
the steps become narrower; and the procedure would define a continuous 


TABLE 3 

Height of American Recruits 


Height 

(inches) 

Observed number of 
recruits 

Theoretical 

number 

Below 55. 

4 

1 

55 50. 

1 

0 

50 57. 

3 

1 

57-58 . 

7 

2 

58-59 . 

6 

7 

59-60 . 

10 

29 

60-61. 

15 

85 

61-62 .... 

50 

224 

02-63 . 

526 

535 

63-64. 

1237 

1065 

64 65 . .. 

1947 

1854 

65 -66 . 

3019 

2788 

66-67 

3475 

3582 

67- 68 

4054 

3980 

68-69 

3631 

3818 

69-70 ... 

3133 

3126 

70 71. 

2075 

2221 

71 72 . 

1485 

1350 

7273. ... 

680 

703 

73-74 . . 

343 

325 

74-75 . 

118 

125 

75-76 . 

42 

42 

70-77 , . 

9 

12 

77-78 . 

6 

3 

78-79. . .. 

2 

1 


curve for the limit du 0 if the number of individuals measured were suffi¬ 
ciently increased with the decrease of the interval. For practical reasons, 
of course, the limit cannot be reached, but it can be constructed by extrapola¬ 
tion. The continuous limiting curve in the diagram has been constructed thus. 
We can imagine the curve as being drawn “intuitively”; another procedure 
will be explained later. 

The question arises whether a simple analytic expression can be found 
for such a symmetrical bell-shaped curve. Gauss has answered this question 
affirmatively by showing that the expression 


<p(u) 


h 

y/r 




( 7 ) 
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represents a curve of the type explained; it is usually called the normal curve . 
The choice of the parameter h is left open; thus a more or less steep form can 
be given to the curve. In figure 13 several such curves are drawn' for different 
values of h; a larger value of h means a steeper curve. Therefore h is called 
the measure of precision; a larger value of h means that all values are crowded 



-U — —++U 

Fig. 13. Normal curves for different values of h , according to (7). 


this is the maximum value of <p, whereas for positive or negative u the amount 
of ^ decreases and finally goes asymptotically toward 0. The factor of nor¬ 


malization 


Jl 


is put before the exponential expression in order to satisfy 


the condition of normalization (6) 



This result follows from the relation known from the theory of the exponential 
function. 
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This definite integral possesses a simple value, but the integration cannot 
be carried through explicitly for arbitrarily chosen limits. Tables have been 
constructed, however, for numerical calculations. 

If the peak of the normal curve is not situated at u = 0 but at u = u [h 
we obtain, instead of (7), the expression 


<p(u) = 


h 

VX 


g — A*(u— uq ) 5 


( 10 ) 


For this function, too, (8) is fulfilled. In (10) the Gauss function has a form 
that can be made to fit an observed distribution. We choose for u Q the mean 
value of all measurements; in the example about the recruits the mean value 
is equal to 67.701 in. The parameter h is to be selected so that the curve 
follows the observed distribution as smoothly as possible. A procedure 
achieving this purpose will be explained in § 43. For the measurement of 
recruits, h is == 0.274 (in the scale given for the ordinate on the left side of 
fig. 12). With the help of the values Uq and h we can calculate from (10), 
conversely, the value <p(u) belonging to each interval du; by comparing the 
calculated values with the observed values we are able to judge how well the 
function (10) fits the problem. The calculated values of <p(u) are listed under 
the heading “Theoretical number” in the last column of table 3. (They refer 
to the scale indicated for the ordinate on the right side of fig. 12, p. 212.) 
We recognize the excellent correspondence, which proves that the function 
(10) fits the observed curve very well. 

That it is possible to represent a distribution, as given in the example, 
by an expression like (10) must be regarded as an empirical fact. Unfor¬ 
tunately, some authors have believed all sorts of secrets to be hidden in the 
normal distribution; they have even regarded the normal curve as a mys¬ 
terious law of all natural phenomena. But the normal curve is certainly not 
a law for all objects; there are distributions that follow the Gauss law and 
others that do not. It has to be tested for every set of observational data 
whether it satisfies the normal distribution. 

The great practical importance of the normal distribution consists in the 
fact that we find many applications for it; we are thus in a position to express 
various sets of statistical material in terms of the same simple analytic expres¬ 
sion, the individual form of each set being characterized by only two parame¬ 
ters, namely, the quantities h and u Q (for this point see p. 221). 

For the two-dimensional attribute space, also, normal distributions can be 
defined. They are characterized 2 by the analytic expression 

<p(u,v) = — e-<*V+*’> (11) 

7T 


2 In the general case the exponent contains a positive-definite quadratic form; but it can 
always be changed into (II) by transforming the main axes. 
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Temperature, decreasing 


Fig. 14. Hertzsprung-Russell diagram of fixed stars, 






Absolute brightness, increasing 
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For the case of rotational symmetry, in particular, we have hi = A 2 ; such a 
“Gauss bell” is produced by the rotation of one of the curves of figure 13 
around the vertical axis. The distribution of shots hitting a target (according 
to fig. 10, p. 206) is of this type. The extension of these considerations to 
attribute spaces of more than two dimensions offers no difficulty. 

A second example concerns an instance in which the distribution does not 
have the character of a normal curve. In recent years the statistics of stars 
have assumed major importance in astronomy. E. Hertzsprung and H. N. 
Russell, in particular, have compiled statistics of fixed stars in which the 
stars are arranged according to temperature and brightness. We are dealing 
with a two-dimensional attribute space. The temperature is determined by 
the spectral type of the star, which is denoted by the letters A, /i, . . . ; 
intermediate values are indicated by the addition of a subscript varying from 
0 to 9 such that A 0 ,B 0 , . . . represent the pure types. This parameter, denoted 
by u, is drawn as abscissa in figure 14, for which the temperature in degrees 
centigrade has also been indicated. As ordinate we use the absolute brightness, 
the brightness that the star would exhibit for a certain normal distance. In 
figure 14, which is taken from an article by H. Mineur, 3 a dot is made for 
each star at the corresponding place in the coordinate system, so that the 
density of the dots represents a measure of the relative frequency at each 
place. The total number of stars represented in the diagram is 3,360. 

By analogy with the one-dimensional case we can erect above each small 
rectangle ds — dudv of the iqi>-plane a prismatic column of such a height that 
its volume is equal to the corresponding relative frequency. The staircase¬ 
shaped histogram converges toward a continuous surface for the limiting 
case ds = 0. The shape of these “probability mountains” may be recognized 
from figure 14 if the blackening given by the density of the dots is regarded 
as a measure of the height. A presentation in contour lines, as used for moun¬ 
tain maps, can also be employed; a presentation of this kind is given in 
figure 15. For this purpose figure 14 was covered with a net of little squares 
ds. The side of the square, compared to the ordinate v, was dv = § magnitude 
(of the stars); compared to the abscissa u , it was du = l distance between 
two spectral types. Since the relation between distance of spectral types and 
temperature is not linear, du represents different intervals of temperature, 
varying with the place. A square ds of this size is drawn in the lower left 
corner of figure 15. The number of stars for each square was counted, and the 
contour lines were drawn according to the result of the enumeration. The 
numbers written on the contour lines are the absolute numbers thus obtained, 
so that they define for the square ds a scale in which ds = analogous to 
the scale indicated on the right side of figure 12 (p. 212). 

* Bull . Soc . Astron . de France , Vol. 45 (1931), pp. 4-20. 



219 


§ 43. THE ONE-DIMENSIONAL ATTRIBUTE SPACE 

In figure 15 the shape of the probability mountains is clearly visible; from 
the upper left corner to the lower right corner runs a continuous ridge, and 
at the upper right corner lie two further crests, relatively isolated. The shape 
of the probability surface expresses a peculiar law of the stellar system, the 
deeper significance of which has not yet been completely understood. Astron¬ 
omers, however, have ventured to make the inference that the line of the 
ridge running from the Isolated crests at the upper right to the upper left 
and thence to the lower right corner represents a picture of the life history 
of an individual star. This is an inference in terms of a probability lattice; 
it is assumed that the life lines of the individual stars, conceived as prob¬ 
ability sequences, form a homogeneous lattice, so that a vertical cross section 
at a time i = const, is identical with a horizontal cross section in the /-direction. 


§ 43. The One-Dimensional Attribute Space 

The special case of an attribute space having only one dimension may be 
regarded as a generalization of the probability sequence with coordinated 
amounts. The generalization consists in that the amounts u { are no longer 
restricted to certain fixed steps of amounts u m , but may assume any values 
u located on a continuous scale, for which there need not exist an upper or a 
lower limit. Because of this relation it is possible to transfer to the one¬ 
dimensional attribute space the formulas developed previously for average 
and dispersion. 

The statistical definitions of average and dispersion are the same as in the 
case treated above, since the expressions 



1 n 

= -2> 

«• 

(la) 

M(u'y 

= lim M n (u i ) i 

n-* oo 

(lb) 

A n2 (u i ) i 

= M'iu* — M(u )) 2 

(2a) 

A 2 (u') i 

= lim A"'-(w') * 

n-*oo 

(2b) 


can be applied to a continuous variable u as well as to quantities u m varying 
by steps. Such an application is possible because the summation by the super¬ 
script occurring in these expressions refers to the elements of the sequence 
and is therefore not affected by the transition to a continuous variability of 
the u . It is different, however, with the theoretical definition. Since in this 
definition we take the sum of steps of amounts, the summation must be re¬ 
placed in the continuous case by an integration. In a manner similar to that 
used for the probability, we assume a division by steps of amounts du h duv, 
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. . . du m , ... of a small finite width; the probability of hitting such an 
interval is then approximately equal to <p(um)du mf and instead of (12, §35) 
we write 

/* +0O 

M{u) = lim 22 <p{Um)(hi m • u m = I u ■ <p(u)du (3) 

dum-*0 — J — co 

The proof that this expression is equal to (16) is easily given by the use of 
the relation derived above for amounts varying by steps in combination with 
certain familiar methods of the differential calculus. This proof is based upon 
the simple inequality 

CD 0 

J2 <P*(Um)dll m * V m + J2 <P*(Urn)dU m ' ^ 


= jL <P*(Urn)du fn * Mm+1 + />2 <P*(u m )du n 


(3') 


in which we have put m — 0 for u m — 0, that is, u 0 = 0; <p* has the meaning 
introduced in (2, § 42). Equation (3) follows from (30 because the expres¬ 
sions written on the left and right sides assume the same value for the limit 
ditm -^0 and become equal to the integral on the right side of (3). According 
to a previous remark (p. 181), the convergence of the expressions (3) and 
(30 must be regarded as an additional assumption, which is not guaranteed 
by (6, § 42). Corresponding assumptions are to be made for the definition 
of similar quantities given in the following considerations. 

Similarly, we obtain the theoretical definition of the dispersion by analogy 
with (G, § 37): 

r +00 

5 2 a • tp(u)du 8u = u — M{u) (4) 


Furthermore, all relations derived previously remain valid in a similar trans¬ 
lation. Thus we have, in analogy to (3, § 37), 

/ +00 f* +co /* 4-CO 

8u • ip(u)du = I u • <p(u)du — M(u) • I <p(u)du = 0 

-oo J —00 J —00 

(5) 

By means of this relation we can easily derive the theorem of the shift of the 
reference point, analogous to (12, § 37): 


Uq 


8 0 u — u — w 0 5uo = M{u) 

/ +0O 

[8 2 u — 28u8u 0 + 8 2 u q ] <p(u)du 

-oo 


= A 2 (w) + [M(u) - Wo] 2 


(6) 
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Choosing the zero point u ~ 0 as point of reference u 0 and solving for A 2 (u), 
we obtain the relation analogous to (13, § 37): 

/ +00 

u 2 * (p(u)du — M 2 (u) (7) 

-oo 

The last form is a convenient transformation of the definition (4) of the dis¬ 
persion. 

It is of special interest to determine the mean value and the dispersion for 
the normal distribution. We have with (9, § 42) and the substitution 
x = h(u — u 0 ), 

h f+oo , , 

M{u) — —j=. I u • e~ h (u “ u o^ du 

V 7T J -oo 

1 4-oo s 1 /* 4-oo 2 

= — 7 = * u 0 I e~ x dx +7 - /=- I x • dx = w n ( 8 ) 

V 7r J — oo /i ■ V 7T J—oo 


a ° ( «) = 4 - ■ /_; 


Mo ) 2 • e - ' ,!(u - u o ), du 




(9a) 


A(m) = 


1 

/i • \/2 


(96) 


For these equations we have used the relations, known from the theory of 
the exponential function, 




( 10 ) 


The value u 0 of u at which the peak of the Gaussian bell curve is situated 
represents the average; and the most probable value of u coincides with the 
average. If it is known that a given distribution is of the normal type, it is 
permitted to treat the mean value of a number of measurements as the most 
probable value. The same result holds for distributions of a different char¬ 
acter, provided they are symmetrical with respect to their peak value; but 
the theorem does not hold for all distributions, and it is not possible, there¬ 
fore, unless specific reasons are known, to equate the mean value with the 
most probable value. 

We see from (9 b) that the dispersion, or standard deviation, of the normal 
curve is connected by a simple relation to the measure of precision. Since 
the normal distribution contains only the two parameters Uq and h, it is 
completely determined by the average and the dispersion. The normal curve 
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drawn in the diagram of the recruits in § 42, figure 12 (p. 212), was calculated 
by these relations. Its peak was drawn at the place of the average u 0 = 67.701 
in., and the parameter h — 0.274 was calculated from the statistical dispersion 
A (u) = 2.5848 by means of (9b). The numbers given in the last column of 
table 3 (p. 213) supply a test, which shows that the observed distribution 
is well represented by the normal curve thus computed. Usually, whenever 



h = 2 h - 4 

A(u) = 0.35 A(u) = 0.175 

Fig. 16. Standard deviation for two different normal curves. 


we have good reason to assume that the distribution is of the normal type, 
it will be unnecessary to draw the step curve and to make the test; the normal 
curve is then calculated directly from the values M(u) and A (u) of the 
statistics. 

It is of interest to calculate for the normal distribution the probability w A 
belonging to the linear dispersion, that is, the probability that a value w* 
deviates from the average only within the limits ± A (u), given by the stand¬ 
ard deviation. In Tchebychev’s inequality (18, § 37) we found a more general 
probability of that kind; but the inequality merely demonstrates the 
trivial fact that w A must be ^ o, so that in the case of distributions of a 
general type no definite statement about w A can be made. But for normal 
distributions w A can be determined. If the average is assumed to be situated 
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at the zero point (a simplification that docs not change the result), the prob¬ 
ability obtains the form, with the substitution x = h • u> 

w A = - h /=r- ■ f ] h ' V ’ 2 c- ,?u * du = - f • f + ^ fi” 1 ’ dx = 0.682G0 

Vir J_!_ V»r J_J_ 

h • V2 V2 

The last integration is carried out by means of a table of the Gauss function. 

It is particularly important that the probability w A is independent of the 
measure of precision h. This result makes it possible—if the Gaussian char¬ 
acter of the distribution under consideration is known—to interpret the dis¬ 
persion found statistically as follows. We may expect with the probability § 
that an amount u { deviates from the average only within the limits ± A (?/). 
This relation justifies the usual procedure of stating the result of a scries of 
measurements in the form of the mean value by adding the limits ± A(w). 
In dealing with normal distributions, it is always the same probability, that 
is, approximately the probability §, with which the result may be regarded 
as lying within these limits of error. 

The dispersion, or standard deviation, A (u) for two different normal curves 
is presented in figure; 10. The shaded areas represent the probability ; 
these areas, for both curves, are equal to about | of the total area. For the 
steeper curve the shaded strip must, therefore, be narrower. 

§ 44. Many-Dimensional Attribute Spaces 

For many-dimensional attribute spaces the concepts of average and disper¬ 
sion can be defined in a generalized meaning. The resulting concepts are 
mathematically analogous to certain concepts of mechanics: the probability 
function may be compared to the mass density, the average to the center of 
gravity, the dispersion to the momentum of inertia. Therefore the average 
is frequently called the momentum of the first order, the dispersion the 
momentum of the second order. This analogy makes it clear that it is possible 
to introduce momenta of higher order into the probability calculus, corre¬ 
sponding to those of mechanics (this holds also, of course, for the one-dimen¬ 
sional attribute space); and the interesting theorem has been derived that a 
distribution is determined by the totality of its infinitely many momenta. 
But these problems will not be discussed, since they belong in the purely 
mathematical parts of the probability calculus. 

However, the many-dimensional attribute spaces will be considered from 
another point of view. We can regard the different dimensions of the attribute 
space as representing different attributes, combinations of which are coordi¬ 
nated to the elements x* of the sequence; and thus every attribute pair, 
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attribute triplet, and so on may be regarded from the standpoint of the prob¬ 
ability of combinations. Call B m the case, for which u lies within an interval 
from Um to u m + 1 , and Ci the case for which v lies within the interval from 
Vi to rj+i; then the probability that the attribute point falls into the rectangle 
u m to u m + 1 , vi to vi+x is given by the expression 

J f*Um+ 1 1 

I <p(u,v)dudv (1) 

u m Jvi 


The probability function of many variables may therefore be regarded as the 
density of a probability of combinations. In order to arrive at the probability 
of the attribute B m , taken alone, w r e construct the always true disjunction 

(CzVC/^CoVCxVC.xVCsVC^V. . . ) (2) 


and then apply the theorem of addition, which means the same as extending 
the integration over the whole domain of the variables. Thus wc have 

P(A,B m ) = P(A,B m .[C 0 V C l V C-! V C\ VC- 2 V . . .]) 


We put 


r u m+i r+co 

~ I I <p(u,v)dudv 

%J u m —oo 

/ +oo 

< p(u,v)dv 

-oo 


(3) 

(4) 


The one-place function <pi(u) is the probability function that controls u 
taken alone, that is, it measures the probability of a w-value irrespective of 
the values v. Analogously, the one-place function 

/ +00 

<p(u,v)du (5) 

-oo 


is the probability function controlling v taken alone. Therefore, we have 


P(A,B m ) 


-I 




<Pi(u)du P(AjCi) 


v l 


(P2{v)dv 


( 6 ) 


Equations (4) and (5) represent a continuous generalization of the rule of 
elimination. 1 In conjunction with (6) and (1) they permit us to derive the 
relative probabilities P(A.B m ,Ci) and P{A . Ci,B m ) when we employ the gen¬ 
eral theorem of multiplication: 


P{A.B m fii) 


r u m +1 r v i+i 


(p(u,v)dudv 


S' 

JUy 


m +1 


<pi(u)du 


(7a) 


x For a form corresponding more closely to (21, § 19) see (7, § 45). 
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P(A . C h B m ) = 


r u m+ 1 rn 

Ju m Jvi 


(p(u,v) dudv 


r v i+L 

I <p 2 (v)dv 

m) VI 


The relative probability appears, as in (2, § 41), as the quotient of two in¬ 
tegrals. The special case in which the two dimensions u and v are mutually 
independent is characterized by the equations 

= P(A,Cz) (8) 

P(A.C h B m ) = P(A,B m ) 

which lead, with (7) and (0), to the relation 

r u m+i r v i-n r u m+i rnn 

I I <p(u,v)dudv = I <pi(u)du • I <p 2 (v)dv (9) 

Ju m Jvi Ju m Jvi 

If these equations are to be satisfied for any choice of the coordinates u m 
and Vi and of the magnitude of the intervals u m — u m + 1 , v t — the relation 

<p(u,v) = <pi(u) • <p 2 (v) (10) 

must hold. For the attribute space of independent dimensions the probability 
function splits up into a product of one-place probability functions. The 
function <p(u,v) y when considered as a surface in a three-dimensional space, 
thus represents a surface of the following properties: the two sets of planes 
u = const, and v = const, intersect with this surface in such a w r ay that any 
two intersection figures of the same set result from each other by a multiplica¬ 
tion of their ordinates by a factor. This factor is given for the first set by the 

ir. _x _^ _i iL. • i__1_ 


ratio —, if Um — const, and vi 

(p2KVi) 


const, represent the intersecting 


a corresponding relation holds for the other set. The process of expansion (or 
shrinking) of a figure in only one dimension is usually called a dilatation; 
we may say, therefore, that any two intersection figures of the same set 
result from each other by a dilatation. Thus we call a surface <p(u y v) of the 
property (10) a dilatation surface. 2 

An example of a dilatation surface is presented by the Gauss bell erected 
over the target, as shown in figure 10 (p. 206); its vertical plane sections are 

2 Since the splitting (10) results only for a certain choice of the coordinates, the surface of 
dilatation must be defined more precisely by the condition that there exist rectangular 
coordinates for which the equation of the surface splits up according to (10); or that there 
exist two sets of planes at right angles to each other, intersecting with the surface in such a 
way that any two intersection figures of the same set result from each other by the operation 
of dilatation. 
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normal curves of different scales. Analytically, this property follows from the 

relation , 2 , 2 , 2 , .22 ,n /ii\ 

e~ h(u + v) = e~ hu • e~ hv ( 11 ) 


However, dilatation surfaces are not necessarily rotation surfaces; and, vice 
versa, rotation surfaces are not always dilatation surfaces. 

These considerations may be extended with respect to attribute spaces of 
several dimensions. The coordinates of an ^-dimensional attribute space are 
denoted by u a) . . . u (n) ; the parentheses around the superscript are to indi¬ 
cate that this superscript stands for the dimension of the attribute space and 
not for the individual value of u coordinated to the n-th (‘lenient (see the 
notation introduced in § 35). This role is then assigned, as in § 39, to a second 
superscript. When we write the second superscript, the parentheses around 
the first superscript are no longer required ; for instance, u ki would demote the 
attribute value of the i-th element of the probability sequence in respect to 
the h -th dimension of the attribute space. The rule of elimination assumes 
the form of an (n — l)-fold integration: 

/ +00 r +00 

... I <p(u {v . • • u (n) ) du {2) . . . du (n) (12) 

-CO J —00 


The other one-place probability functions, ^ (a (2) ) and so on, are introduced 
in a corresponding manner. The relative probabilities can be defined in analogy 
to (7); this definition can be given in different ways, depending on how many 
dimensions are used in the first term, that is, the reference class. But we con¬ 
sider here only the case of independent dimensions, which is characterized 
by the relation 


<p(u 


(i) 


. u 


(«)■ 


) - *1(a (1) ) 


<Pn(u<") 


(13) 


Consequently, the function <p ( u (l) . . . u M ) represents a dilatation surface 
in an (n + 1) = dimensional space. 

The attribute space of several dimensions can be applied to the treatment 
of event combinations. For this purpose the attribute spaces of separate 
sequences are combined in a single higher-dimensional attribute space. For 
instance, the physical state Xi of the wind as observed daily is represented 
in a two-dimensional attribute space u (l) u (2) , when we characterize the wind 
by its intensity and direction; the daily temperature yi and the rainfall 
may each be characterized, respectively, by the one-dimensional attribute w (8) 
or w (4) . The data may be combined so that we can interpret them as an 
attribute point in a four-dimensional space u {l) . . . u U) ; this attribute point 
is coordinated to the triplet Xi,y if Zi f of events. By regarding such combina¬ 
tions of events as one event, we can extend the considerations developed for 
the attribute space of many dimensions to the treatment of the combination 
of different sequences. 
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For the sake of simplicity, assume that the individual sequence has a one¬ 
dimensional attribute space; the n-sequences then determine an n-dimensional 
attribute space. If the sequences are completely independent of each other 
(§23), the probability function of the ^-dimensional attribute space can be 
factorized according to (13) as the product of the one-place probability func¬ 
tions. Such a case may be discussed in a manner similar to the treatment of 
steps of amounts, as explained in § 38. 

Beginning with the derivation of the results for average and dispersion of 
additive amounts in two attribute dimensions, analogous to (3, § 38), we 
obtain 


M(u + v) 



(it + v) ' <p(u,v)dudv 


/ 4*oo f* +oo f* 4-00 f* 4-oo 

u • [ I <p(u } v)dii\du + I v * [ I <p(u,v)du]dv 

-OO d —CD d --CO d _oo 

/ -Coo f* 4-oo 

u ■ (pi(u)du +1 v • <p 2 (v)dv = M(u) + M(v) (14) 

-OO d —oo 


It is not necessary to presuppose the independence of the dimensions. The 
corresponding result for the dispersion, however, is valid, as before, only for 
independent attribute dimensions, in analogy to (8, § 38). We obtain, with 
the help of (14), 


8(u + t>) = u + v — M (u + v) = u — M(u) + v — M(v) — 8u + 8v 


A 2 (u + v) = M(b\u + r)) 


- ex 

Using (10) we derive 
A 2 (m 4- t>) = f 8 2 u ■ <fi(u)du • f 

d —no d 


(S 2 u + 2 8u$v + 8 2 v) • <p(u,v)dudv 


ip 2 (v)dv 


f* 4-oo f* -foo 

+ 21 8u • (pi(u)du I 8v • ip 2 (v)d,v 

d —00 d —oo 

/ 4-oo r -Coo 

8 2 v • <p 2 (v)dv • I <pi(u)du 

-OO d — oo 


= A \u) + A 2 (iO 


(15) 


06) 


since the middle term vanishes because of (5, § 43). 

The extension of these formulas to n attribute dimensions need not be 
explained, since it is identical with the considerations of § 39. We can take 
over all the formulas of § 39. There will be no change in the formulation of 
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the derivations, since we have always employed only the relations (3 and 8a, 
§38) and (13 and 15, §37), which are written in a symbolic notation and 
which we have now derived once more in (14) and (16) and in (7, § 43). The 
summations occurring in the formulas of § 39 remain summations for the 
continuous attribute space, since the sums refer to the individual dimensions. 

From the viewpoint of the' continuous distribution another problem may 
be considered. We can inquire after the distribution 

<p(u) W = n • Z « (t) (17) 

k~ 1 

that is, after the probability function that controls the mean value of the 
7i one-dimensional attributes u ik) ; u then corresponds to the quantity that 
we would denote by in the notation of § 39. It can be shown that the 
function <p(u) converges with increasing n toward a normal distribution if 
the attribute dimensions are completely independent of each other and if 
certain other properties are fulfilled. The proof cannot be derived without 
a mathematical apparatus more elaborate than falls within the scope of this 
book; the mathematical literature may be consulted for this purpose. 3 For 
practical applications the theorem is of great importance, because it explains 
the fact that the normal distribution finds many applications. The fluctua¬ 
tions of observable quantities can frequently be reduced to additive fluc¬ 
tuations of a great number of elementary quantities; in all such instances the 
observable result corresponds wdth good approximation to the normal dis¬ 
tribution. An example is the theory of observational errors, since the error of 
each single measurement is to be conceived as the superposition of numerous 
elementary errors; or the kinetic theory of matter, in which a normal distri¬ 
bution is derived for the velocities of the molecules (the Maxwell distribution). 


§ 45. Relative Probability Functions 


In the attribute space of independent dimensions the relative probability 
(7, § 44) becomes identical with the one-place probability (6, § 44) because 
of (8, § 44); but in the attribute space of dependent dimensions the relative 
probability for intervals is to be expressed by the quotient of two integrals 
(7, § 44). For this case, too, we can derive a few general properties of the 
relative probability. If in (7a, § 44) the value u m is made to approach a m +i, 
while ui and u w remain fixed, the quotient does not go toward zero but to 
the expression 


C V 1+1 <p(u m ,v) 
Jv L <pi(u m ) 


dv 


( 1 ) 


3 See, for instance, Richard von Mises. Vorlesungen aus dern Gebicte der angewandten 
Mathematik, Vol. I: Wahrscheinlichkeitsrechnung (Leipzig, 1931), p. 216. 
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The corresponding statement holds for (7b, § 44). When we put 


>h(u)v) 


<p(u,v) 

<Pl(u) 


= 


<p(u,v) 

<p 2 (v) 


( 2 ) 


the two functions \f/i(u;v) and fa(v m ,u) represent one-dimensional 'probability 
densities; integrated over the second variable they lead to probabilities. I shall 
call them relative probability functions. Integrated over v, the function (u\v) 
supplies the probability that v lies within a certain interval if u possesses a 
precisely given value; and, correspondingly, integrated over u, the function 
\p 2 (v;u) supplies the probability that u lies within a certain interval if v has 
a precisely given value. These probabilities are not zero, since the precise 
value occurs in the first term, or reference class. For instance, if a certain 
precise value of u is given, the probability that v assumes any value within 
the whole domain between — °° and + ®> is = 1. An integration over the 
first variable is not permissible, that is, it has no meaning in the theory of 
probability. 1 In order to indicate that the two variables are of a different kind, 
a semicolon is placed between the variables. If we wish to obtain the relative 
probability for an expression with a finite interval in the first term, we must 
not integrate over the first variable but have to go back to the integral 
quotient (7a, § 44) or (7b, § 44). The reason is that the probability of an “or” 
in the first term, according to (4, § 22), does not represent an addition, but 
indicates the formation of a mean value; (7a, § 44) and (7b, § 44) are analogous 
to (4, § 22). However, for small intervals e (the interval v t to Vi+i may be 
large) we have the approximate equality 


rn +1 

. j„ r ,.. y . 

r\ ( u)du J- 

J u 


(3) 


so that it is possible, in practice, for small intervals in the first term, to 
construct the relative probability functions (2) by methods of enumeration 
similar to those used for absolute probability functions. 

As examples of relative probability functions, figures 17 and 18 present 
certain transformations of the diagram of the Hertzsprung-Russell statistics. 
In figure 17 the function <p(u,v) was integrated in the indirection according 
to (4, § 44), and thus the function \pi (u;v) was constructed by the help of (2). 
In figure 18 the function <p(u,v) was integrated in the w-direction according 
to (5, § 44), and then the function fc(v;u) was constructed according to (2). 
The calculation was carried out as follows: the diagram of figure 14 (p. 216) 
was covered by the net of squares ds, and the numbers of points per square 


1 See, however, formula (7). 
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Fig. 17. Hertzsprung-Russell diagram redrawn as relative probability function: integration 
of fig. 15 in vertical direction; function of (2) represented by contour lines. 
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Fig. 18. Hertzsprung-Rus 
of fig. 15 in horizontal 
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were added for each horizontal row and vertical column, respectively; then 
the number of points in each square was divided by the amount of its vertical 
column or horizontal row. Thus the contour lines in figures 17 and 18 were 
drawn. The numbers indicated for the contour lines correspond to a scale 
ds — 1. Thus they cannot be compared directly to the corresponding num¬ 
bers of figure 15 (p. 217); a scale like that of figure 15 cannot be employed 
for these diagrams, since the denominators of the fractions obtained are 
different for each column or row. The following examples, which can be read 
directly from the diagrams, may explain the meaning of the diagrams: for a 
star of spectral type M b , the probability that its absolute brightness lies 
between + 9 and + 10 has the value 0.05 (fig. 17); for a star of the absolute 
brightness + 10, the probability that its temperature lies between 3250° K 
and 2750° K has the value 0.5 (fig. 18). 

Both diagrams exhibit contour lines similar in shape to those of figure 15, 
but some characteristic differences are noticeable. In figure 18 the ridge 
running from the upper left to the lower right corner is ascending, but it 
descends in figure 15. Furthermore, in figure 18 a new crest appears, meaning 
that, although not many stars have a magnitude as large as — 2 to — 4, 
there is a high relative probability of a spectral type lying close to Go if such 
a star occurs at all. The significance of the diagrams consists in that they 
show more strongly the relation between brightness and temperature, elim¬ 
inating from this relation certain accidental features and even systematic 
errors. For instance, in figure 15 the close relation between brightness and 
the spectral type 6 r 0 , which is shown in figure 18 by the new crest mentioned, 
is suppressed by the accidental feature that few of these bright stars exist. 
A systematic misrepresentation, however, is produced in figures 14 and 15 
in the following manner. Since it is more difficult to find faint stars than to 
find bright ones, relatively few stars will appear in the lower part of the 
diagram. Or, more precisely speaking, since a bright star can be seen for 
great distances, where a dark star can no longer be seen, the bright stars as 
counted will fill a greater part of the cosmic space than the dark ones. The 
original diagrams of figures 14 and 15 thus are richer in bright stars, so 
that for larger values of the ordinate v (that is, the upper part of the diagram) 
the probability mountains become too high. Figure 18 is free from this error, 
as is shown by the upward slope of the ridge extending from the upper left 
to the lower right corner of the diagram, whereas the downward slope of the 
ridge in figure 15 must be regarded as an effect of the systematic misrepre¬ 
sentation mentioned. 

By means of relative probability functions it is possible to construct con¬ 
tinuous extensions of the schema of figure 5 (p. 82) and of the theorems 
referring to it. Assume that the disjunction Bi V . . . V B r is replaced by 
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intervals du of a continuous variable u. We then replace B k by the expression 
B Ut du, meaning “a value in the interval from u to u + du”, and thus put 

P(A,B k ) = P(A,B Utdu ) = <p(u)du (4) 

For a precise definition of these symbols, we would first give the definition 
for finite intervals du and then proceed to the limit du = 0. Our way of 
writing means that the expression (4) leads to correct results if transitions 
to the limit du = 0 are derived. 

Assume further that a second quantity v is related to u by a relative prob¬ 
ability function yh(u;v), so that we have 

P(A Jh,C) = P(A .B Vt du,C Vt dv) = P(A .B Uj C v , dv ) = fr(u;v)dv (5) 

The transition from the second to the third term is possible because a relative 
probability has a precise value in the first variable; for small intervals du 
any value u of the interval may be chosen, according to (3). Generally speak¬ 
ing, a term B Vtdu can be replaced by B v if it stands for the reference class of 
probability expressions; but if it stands for the attribute class, the corre¬ 
sponding probability function is to be multiplied by du. Like (4), formula (5) 
is correct for transitions to the limit du = 0, dv = 0. 

Finally, we put 

P(A,C) = P(A,C r>dv ) = x (v)dv (6) 

This formulation is subject to the same qualifications as the preceding ones. 
We can now write the continuous form of the rule of elimination (21, § 19): 

X -foo 

<p(u) ■ Mv;v)tiu (7) 

-co 

The value dv is canceled on the two sides, and thus (7) presents the probability 
density x( v ) ns a function of the two probability densities v{u) and \f/\(u;v). 
Formula (7) shows that a relative probability function can be meaningfully 
integrated over the first variable if it is multiplied by another probability 
function. 

For an illustration, let u be a man’s height, whereas v is his weight. Height 
and weight are connected by the relative probability function \pi(u;v). This 
means that the height of a man does not determine his weight; but if his 
height u is known, we know with a certain probability that his weight is — v 
within the interval dv. Note that this probability is practically the same 
whether his height is given as a precise value or within a certain small interval 
du; but the interval dv, of course, has an influence on the numerical value of 
the probability. The distributions <p(u) of the height and x( v ) of the weight 
are connected by the relation (7). 
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The continuous form of the rule of Bayes (10, § 21) results as follows: 


fa(v; u) = 


<p(u) ■ <Pi(u; v) 


X 


+oo 


(p{u) • yf/i(u;v)du 


( 8 ) 


The value dv , by which the expressions in numerator and denominator are 
multiplied if they arc to represent probabilities, drops out. Like (7), formula 
(8) is thus a relation between probability densities. In the illustration used, 
\p 2 (v;u) is the probability density that a man of weight v has the height u. 

Let \p[(ui,U‘ 2 ;v) be the probability density that, if u is in the interval from 
Vi to U 2 , the other quantity has the value v. This probability is determined 
by the continuous form of the special rule of reduction (4, § 22): 




2 

<p(u) 

U ! 


ypi(u\ v)du 


r*u2 

I <p(u)du 

Jvj 


( 9 ) 


In the illustration, \f/'i(u h U 2 ]v) is the probability density that a man whose 
height is between u\ and has the weight v. Like (7) and (8), formula (9) 
is a relation between probability densities. 

Formula (9) shows that the transition from \p Y (u;v) to a finite interval in 
the reference class leads to a somewhat involved expression, which requires 
knowledge of the antecedent probability density <p(u). The transition from 
\pi(u)v) to a finite interval in the attribute class, however, is achieved by a 
simple integration: 

„ rv 2 

^ x (u;v 1,02) = I (u;v)dv ( 10 ) 


Formulas (9) and (10) exhibit the intrinsic difference between the two vari¬ 
ables of relative probability functions. 

Relative probabilities have great practical importance, but this fact has 
rarely been recognized clearly. They always occur when physical laws are 
found or tested by experiment. A physical law presents a relation between 
two quantities such that if one quantity has a certain value, the other quan¬ 
tity also assumes a certain value. But in the application of physical laws it is 
never possible to take into consideration all the effective factors. Thus, if the 
first quantity has a definite value, the value of the other quantity can be 
determined only with a certain probability. If we regard the first quantity 
as a variable, all the mathematical functions occurring in the laws of nature 
have the character of relative probability functions. The assertion of the 
redrawn Russell diagram—If a star belongs to the spectral type M b (that is, 
if it has a surface temperature of about 3000° K), then the probability is 
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0.05 that its absolute brightness lies between + 9 and + 10, and the prob¬ 
ability is 0.3 that its brightness lies between — 0.5 and + 0.5—represents 
the prototype of a physical law. The fact of the probability character of 
physical laws is overlooked because usually the most probable value alone 
is taken into account, and the relation is then treated as a logical implication. 
Thus we say, If the temperature of a steam boiler is 121° C, the pressure 



Fig. 19. Deviation of stellar light in gravitational field of sun, according to 
results of E. Freundlich’s eclipse expedition in Sumatra, 1929. 


amounts to 2 atmospheres. That this assertion does not hold with absolute 
certainty, but merely with a very high probability, becomes clear when we 
refer the second clause to the pressure indication of the manometer, which 
may fluctuate because of inhomogeneous conditions in the boiler. 

The probability character of physical laws becomes obvious whenever an 
exact measurement is made, because the required relations are never strictly 
satisfied, but are disturbed by errors in observation. The rather irregular 
picture obtained by representing the results of measurements in a diagram 
can be understood only when it is interpreted by means of the theory of prob¬ 
ability. For an illustration I refer to the result of Freundlich’s eclipse expedi¬ 
tion in Sumatra, 2 which was undertaken in order to measure the deviation of 

2 E. Freundlich, H. von Kliiber, and A. von Brunn, “Ueber die Ablenkung des Lichtes im 
Schwerefeld der Sonne,” in Abhandl. d. Berliner Akad ., math.-phys. Kl., 1931, p. 35. 
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a light ray of a star under the influence of gravitation when the ray passes 
close to the sun, according to Einstein's theory. In figure 19 the distance of 
the light ray from the sun (that is, the angular distance of the star from the 
sun) is indicated on the abscissa in multiples of the sun radius; the ordinates 
represent the observed deviation of the ray in angular seconds. Each dot of 
the diagram corresponds to an observed star. 

The theory demands that to every distance there corresponds a certain 
deviation of the light, increasing with smaller distance such that the product 
of deviation by distance remains constant. This relation, which specifies a 
hyperbola, is fulfilled qualitatively by the curve as drawn, which collectively 
represents the measurements. The stars, however, show a large dispersion and 
the shape* of the curve cannot be directly recognized; at some places many 
ordinate values lie above each other for the same value of the abscissa. The 
picture, strictly speaking, would have to be characterized by a relative prob¬ 
ability function \f/ (u;v) t which determines for every distance u a number of 
values v of the deviation, each with a certain probability. But the number of 
measured points does not suffice for the determination of this probability 
function, that is, of the probability mountains to be constructed above the 
plane of the drawing. 

In such cases we usually ask for the simplest curve that can be traced 
through the points measured. In the interpretation according to the prob¬ 
ability theory, the procedure means that we are satisfied if we can construct 
the ridge line of the probability mountains. The ascertainment of the curve is 
achieved by the method of least squares , which is too complicated mathe¬ 
matically to be dealt with in this book. Since the number of measurements 
usually is not large enough for a precise definition of the curve inquired, the 
balancing of observational results is carried through with the help of condi¬ 
tions concerning the shape of the curve, which results from the physical 
theory. In the instance considered, the balancing is based on the condition 
that the curve must possess the character of a hyperbola. In spite of this 
adaptation to the theory, the procedure supplies a quantitative test of the 
theory. The astronomic measurements of the illustration gave the result that 
the observed deviation is larger than the one demanded by the theory, because 
the theoretical curve would lie somewhat lower than the curve of the diagram. 

The large dispersion of the measurements in the illustration may surprise 
those who believe that the laws of nature should exhibit Absolutely certain 
truth”; but the example does not represent an exceptional case. Other experi¬ 
mental results of great import present a like appearance, since precision 
measurements are always carried out close to the limits of experimental exact¬ 
ness. Freundlich's measurement of the deviation of light, obtained as it was 
by the help of precise and well-constructed apparatus, is a good example: 
only when we use the concepts and methods of the theory of probability are 
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we able to uncover all the laws and regularities that are hidden in the observa- 
tional material. 

§ 46. Continuous Probability Sequences 

The extension of the concept of probability sequence so far considered was 
constructed by means of a transition from discrete to continuous attributes. 
A disjunction of a finite number of terms was replaced by a disjunction of 
infinitely many terms, which could be interpreted as members of a continuous 
manifold, the attribute space. The possibility of this generalization is based 
on the existence of a geometrical interpretation of the axiom system; the 
duality of interpretation provides for an isomorphism between the probability 
interpreted as a frequency and the geometrical relations in the attribute space. 
The transition from the elements of the attribute space to finite domains 
entails, with respect to the? probability, an infinite summation of infinitesimal 
amounts. The transition may therefore be regarded, in correspondence to the 
theorem of addition, as a transition to an or-probability; and thus the measure 
of the probability of finite domains of the attribute space assumes the form 
of an integral expression. 

It is possible to introduce another generalization of the concept of prob¬ 
ability sequence for which the transition to continuous variability concerns 
not the attribute but the (‘lenient of the sequence itself. To simplify the expo¬ 
sition, the sequences of the elements Xi and yi will be identified, so that we 
deal with an internal probability implication (see §9). In the transition to 
be explained, a continuous sequence of the elements x t takes the place of the 
sequence of the; discrete elements The subscript may thereby be assumed 
to represent the continuous time-variable t } so that to each time point t a 
certain element x t corresponds. The transition may be illustrated by the fol¬ 
lowing example. If we shoot with a machine gun at a target, the continuous 
attribute space—the target—is filled with a discrete number of elements; 
each individual hit x % represents an element of the sequence. But one can also 
shoot at the target with a continuous beam of water, ejected, say, from a hose 
moving irregularly and thus tracing an irregular zigzag line on the target. 
Then the hits x t form a continuous event-sequence, in which a definite event 
is defined for each time t. Of this type is the continuous sequence of events 
now to be discussed. 

It will be clear from the illustration that this generalization of the concept 
of probability sequence is possible only if we start with a sequence that has 
a probability aftereffect. If the event x t is located at a certain point u,v of 
the attribute space, then an event x t +<u occurring a short time dt later will 
still lie close to the attribute point u,v, since the event-sequence is continuous. 
This means, however, that there exists a probability aftereffect, manifesting 
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itself as a tendency to remain. It will therefore be possible to apply to the 
sequences considered, in particular, the conditions of a probability transfer, 
in which the probability aftereffect depends on the immediate predecessor. 
This case will be given when the degree of the probability aftereffect depends 
only on the position of the attribute point u,v but not on its velocity. In this 
case the continuous probability sequence corresponds to the generalization 
of the sequence type classified in § 33 as probability drag. The peculiarity of 
continuous probability sequences, in contradistinction to discrete sequences, 
consists in the feature that the aftereffect is to 
be treated by a transition to infinitesimal dis¬ 
tances. The probability of remaining, which was I I I I 

denoted in the case of probability drag by cj cJ c, h lb 

p + c, will go by infinitesimal steps toward 1, I I 

that is, the probability of the event x t +dt lying i \\ \ I 

within the attribute region u + du , v + dv ap- aJ i\a \ 

proaches 1 if dl goes toward zero. The conver- L | 1 

gence toward 1 results from the fact that for i J 

dl — 0 the events x t and x t +dt coincide. It would ♦ 

therefore be impossible to generalize the normal I -—- 

sequence so as to form a continuous event- * ^ 

sequence; only the sequence with a probability Fig. 20. Probabilities for se- 

aftereffect, lends itself to such a generalization, ^^f°lonU~“SiUty 

the probability of the aftereffect approaching sequences. 

certainty by infinitesimal steps. 

These conditions may be illustrated by a translation into probability prop¬ 
erties of sequences of discrete intervals, represented in figure 20 for the one¬ 
dimensional attribute space u. As abscissa the time t is used; as ordinate, 
the attribute u. Let u° be the position of the attribute point at time t 0 , and w 
the probability that a specified interval is reached from u°. For this interval 
a there exists a certain probability w y lying between 0 and 1. If the interval a 
is shifted to a z by way of the positions a.\ and a 2 , w goes toward 1. The same 
holds for the interval b if it is shifted by way of bi and b 2 to a 3 . The con¬ 
vergence toward 1 does not hold, however, if b is shifted by way of C\ and c 2 
to c 3 , since this w goes toward 0; the same applies to w if a is shifted by way 
of di and d 2 to d 3 . The probability state of a point environment is therefore 
represented by sequences the probability of which converges either to 1 or to 0. 

The path of the attribute point of a continuous probability sequence is 
represented for the one-dimensional attribute space by a zigzag line. For an 
illustration I refer to the Brownian motion of a rotating mirror, drawn after 
a photograph taken by E. Kappler. 1 (See fig. 21.) In the experiment of 
Kappler a s mall mirror, 1-2 sq. mm. in size, was suspended by a quartz fiber 

1 Ann. d. Physik., Vol. 11 (1931), p. 242. 


Fig. 20. Probabilities for se¬ 
quences of intervals in migration 
space of continuous probability 
sequences. 
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of a length of some centimeters; the thickness of the quartz fiber amounted 
to only a few ten-thousandths of a millimeter. The mirror and the suspension 
were placed in a glass vessel, which was evacuated to low pressure. Because 
of the small diameter of the fiber and the weak damping by the air, such an 
arrangement is very sensitive to rotation; when the surrounding air molecules 
impinge upon the mirror, the differences in impact become noticeable and 
the mirror describes an irregular rotation. 

It is characteristic of the Brownian motion, and also of other fluctuation 
phenomena, that the number n of the air molecules participating is not so 



Fig. 21. Brownian motion of rotating mirror drawn after photograph by E. Kappler. 
u = angular movement of mirror in minutes of arc (zero point arbitrarily chosen) ; t = time 
in minutes. Reproduced in approximately same size as original photograph. Curve consti¬ 
tutes example of a continuous probability sequence. 


great as is required if the Vn-law (6c, § 39) is to provide for a virtually com¬ 
plete uniformity of the average air pressure. The apparatus is so sensitive 
that minute fluctuations in the average air pressure can be noticed. For the 
observation a special registering apparatus is employed. A beam of light from 
a lamp is reflected by the mirror and projected by means of a lens system on 
a strip of photographic film; while the light beam moves from side to side, 
the film passes simultaneously from top to bottom, producing the “jerky 
curve” shown in figure 21. However, the curve cannot be subsumed under 
the case of probability transfer because the probability of the aftereffect 
depends not only on the position but also on the velocity of the attribute 
point (because of the inertia of the mirror). Here we have a more general 
case of probability aftereffect. 

Figure 21 indicates how the transition to the probability of finite regions 
is carried through. For discrete event-sequences the probability that the 
attribute point lies within the interval u a to u b is given by the number of 
hits falling into this region; for continuous event-sequences the probability 
is measured by the sum of the time intervals during which the attribute 
point remains in the interval from u a to u b . In figure 21 we thus determine 
the probability by adding the distances marked by a heavy line and dividing 
them by the total length of the £-axis. The enumeration of frequencies is 
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here replaced by the measurement of time intervals. In analogy to the N- sym¬ 
bol used previously (la, § 10), we introduce the symbol 

T 

L{ueB) (la) 

for the length of time t during which the point u remains in the region B of 
the attribute space within the time interval from 0 to t. Instead of speaking 
of the sequence x ty it is convenient to regard directly the sequence of attri¬ 
butes u l coordinated to it, u l meaning the value of the variable u at the 
time t. Instead of regarding B as a class of events, we then conceive B as a 
class, or an interval, of numbers (see § 41). The symbol (la) may be abbrevi¬ 
ated, analogously to the abbreviation (2, § 1G), by means of the definition 

L t (B) = Df L(u e B) (16) 

t=> o 

Let A be another region of the attribute space. In analogy to the frequency 
interpretation (5, § 16), the probability for continuous event-sequences can 
be interpreted as i (\ m 

P(A,B) = lim (2) 

T-CO L T (A) 

If the sequence A is compact (as assumed for fig. 21), (2) can be written in 
the simpler form 

P{A,B) = lim — L t (B) (3) 

r-> cd T 

Like the geometrical probability (§41), this generalization of the concept 
of probability sequence depends on the possibility that the axiom system 
can be given a geometrical interpretation. Instead of the enumeration of 
elements, we have here the measurement of lengths of time; it is a measure¬ 
ment along the ^-dimension of a space that is formed by the attribute dimen¬ 
sions together with the ^-dimension. I call this space the migration space . The 
proof that the interpretation given by (2) for the probability concept is admis¬ 
sible is therefore included in the general proof of the possibility of a geometrical 
interpretation. 

No difficulties arise for the incorporation of continuous probability se¬ 
quences in the formal calculus of probability. It is irrelevant for the appli¬ 
cation of the axiom system i-iv whether the elements x iy yi form discrete or 
continuous sequences. Continuous probability sequences, which, according to 
the definition, include the properties of probability sequences with a con¬ 
tinuous attribute, can be incorporated in the formal calculus of probability 
exactly in the same manner as was adopted for primitive probability sequences 
(see § 41). How to determine the probability coordinated to the sequences 
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is not a problem of the formal calculus but of the interpretation. It is only 
with respect to the frequency interpretation that continuous probability se¬ 
quences differ from discrete sequences: continuous probability sequences pro¬ 
vide a model of the formal probability calculus in which an enumeration is 
replaced by a measurement. But all the concepts of the formal calculus are 
applicable. For instance, P(A .B,C ) means the probability of C within the 
subsequence selected by the condition that u lies in B. In the interpretation 
given by (2) we have, therefore, 

P(A .B,C) = lira — v - ----- (4) 

T-+CO L T (A . B) 

With the existence of probabilities of this kind, the concept of mutual de¬ 
pendence or independence is applicable to continuous probability sequences. 

Only the axioms v (§ 28) require special treatment. The condition that 
the sequences contain an infinite number of elements is always fulfilled for 
continuous probability sequences. This condition must be strengthened by 
the requirement that the length of the sequence along the time axis [for the 
classes written in the first term of (4 and 5, § 28)] is infinite. With this addi¬ 
tion the axioms v can be transferred to continuous probability sequences. 
In phase probabilities a superscript like a then denotes a time of finite but 
otherwise arbitrary length a. With this qualification the formulas of the 
theory of order are made applicable; thus the concepts of invariant selection, 
lattice formation, and so on can be defined. The only difference is that no 
normal sequences exist, since all sequences have an aftereffect. 

The mathematical treatment of continuous probability sequences will be 
explained only for the case where the attribute space is divided into two 
regions B and B; so reference will be made only to alternating sequences. 2 
Consideration will therefore be restricted to the projection of the curve on 
the Z-axis in figure 21 (p. 239). The distances in heavy print along the t -axis 
correspond to the case where u e B ; the other distances along the £-axis cor¬ 
respond to the case u e B. As a further simplification (apart from the infinite 
length of the sequence in B and B ), the three conditions are introduced: 

1. The sequence satisfies the condition of probability transfer according 
to § 33, that is, the probability f(t) that u, if it is in B at the time 0, remains 
in B during the time interval t does not depend on whether u was in B or B 
before the time 0. 3 

2. f(t) is a continuous and differentiable function of t. 

3. m = 3. 

2 The mathematical treatment of the continuous attribute space may be found in the 
appendix of the German edition of this book (§81), and in H. Reichenbach, “Stetige 
Wahrscheinlichkeitsfolgen,” in Zs. f. Ph/ysik , Vol. 53 (1929), p. 274. 

8 Notice that/(£) is not a probability function, since f(t) represents a probability and not 
a probability density. 
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Corresponding assumptions are made for the probability g(i) that u, if it 
is in B at the time 0, remains in B during the time interval t. The third con¬ 
dition is required because of the continuity of the probability sequence. 

We derive from the third condition, together with the second condition, 
if we write the Taylor series only for the first two terms (other terms are not 
needed, since wo shall later go to the limit dt =0), 

/(0 + dt) =/(0) + f'(0)dt . . . 

= 1 - adt a = - /'(0) (5) 

With the notation of § 33 we obtain, putting (3 — — g'(0), 


P(A. B,B il ) = px = 1 — adt 
P(A.B,B d ‘) = 1 - p! = adt 
P(A . B,B dt ) = Qi = /3dt 
P{A ,B,B dt ) = 1 - qi = 1 - /3dt 


( 0 ) 


a and (3 will be called coefficients of transfer. We can now apply (6, § 33) and 
thus have 

F(A ’ B) - r ~ = UU5 <7) 


The existence of the mean probability p according to (3) is thus made sure, 
in order to determine the probability fit), we imagine the L- axis to be divided 
Into finite small intervals dt ; if the interval dt lies completely within B, we 
regard it as belonging to B, otherwise to B. We then have 

f(t) = lim P{A.B, B dt .B 2dt . . . B" dt ) (8 a) 

dt-* 0 
ndt = t 

The expression written on the right can be factorized as a product of n factors 
of the form P(A .B,B flt ); this follows from the property of probability 
transfer expressed in (9, § 33). Therefore (8 a) assumes the form 

f(t) = lim (1 - otdty (8 b) 

dt-* o 

ndt =*t 

at 1 

With adt — — = — , that is, n = mat , we have 
n m 

( 1 \ mat 
1-) 

m/ 

= [lim (l - -YT = [lim (l + I)”T 

Lm-^+co \ m/ j Lm->co \ m/ J 
= e~ at 


( 9 ) 
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In the last line the familiar definition of the number e is used. We see that 
the probability fit) of u remaining in B during the time t decreases expo¬ 
nentially. The form of this function is presented in figure 22 A. In a similar 
manner we derive with (1, § 31) for the probability /(f) that u , if we have 
u e B at the time 0, changes over to B during the following time interval f, 

fit) = lim PiA.B,B dt V B 2dt V . . . V B ndt ) 

dt-*ca 

ndt=*t 

= 1 - fit) = 1 - e-« £ (10) 



Fig. 22. A. Exponential decrease fit) = e~ a \ according to (9), with a = 2.5. 
B. Exponential increase J(t) = 1 — e~ at , according to (10), with a ~ 2.5. 


This equation determines an exponential increase for the probability of a 
change to B, as drawn in figure 22 B. When we apply the Taylor series to (10), 
we obtain for small f, according to (6), 

fit) ~ at (11) 

This relation corresponds to the tangent shown as a dotted line in figure 22 B; 
for large f, however, the function fit) increases more slowly. 

These relations have many illustrations. For instance, the continuous 
probability sequence may be given by the path of a gas molecule (its world 
line); B may be a certain region inside the vessel in which the molecule 
moves. If the molecule happens to be in the region B at the time t = 0, then 
the probability that during the time t (that is, at any time point within the 
interval t) it changes over to B at least once is given by (10). Or the continuous 
probability sequence may be the life line of a human being; B may signify 
that the person is ill. Then (10) represents the probability that the individual, 
if in good health at the time t — 0, falls ill during time L According to (11), 
this probability increases proportionally to time only for small f, otherwise 
more slowly. Formula (9) states the probability that the person remains 
healthy during the time t. 




244 


CONTINUOUS EXTENSIONS OF THE CONCEPT 


We calculate now the average length of time during which the attribute 
point remains constantly in B. Such a period is characterized by the condition 
that it starts with a change from B to B and stops with a change from B to B. 
[Notice that we did not demand for the definition of /(/) according to (8a) 
that the stay in B should cease; thus in f(t) cases are counted where the 
attribute point remains in B during a period longer than £.] We inquire after 
the probability <p{t)dt of a period of the kind described, the length of which 
is between t and t + dt, counted among all periods of an uninterrupted stay 
in B. The number of the latter periods being equal to the number of changes 
from B to B, we can represent this probability, before going to the limit, by 

<p*(t)dt = P(A.B.B dt ,B 2dt . . . (12a) 

Because of the property (9, § 33) of the probability transfer this expression 
is equal to 

<p*(t)dt, = P(A.B dt ,B 2dt . . . B {n + 1)dt .B ( "+ 2)dt ) 

= P{A .B,B dt . . . B nd 0 • P{A.B,B dt ) 

= (1 — adt) n * adt (126) 

When we make the transition to the limit dt — 0 for the first expression only, 
employing (9), we have ^ = e _ at . ^ (] 3) 


For the desired average we thus find, according to (3, § 43) with x = at, 



<p(t)dt = 

* e~~ x dx = - 
a 



t • e~ at • adt 


r( 2 ) = - 

a 


04) 


F is Euler’s gamma function, which satisfies for integers z the relation 

P(z) — x z ~ x • e~ x dx = (z — 1)! (15) 

1 

The average length of an uninterrupted stay in B is therefore given by ”• 

Similarly, the dispersion connected with this average can be calculated. 
From (7, § 43) we have 

A 2 (0 = M(f) - M\t) 

With (15) we find 

X oo i rco 

t 2 e~ at adt — -~ 2 J x 2 e~ x dx 


= - 2 r(3) = 2 


a* 
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Wc thus derive 

A 2 (0 = - 2 - = M\t) (16) 

a* a* a* 

A(0 = - = M(0 

a; 

We meet here with the special ease that the linear dispersion is equal to the 
average, that is, the dispersion amounts to 100%. 

The meaning of the quantity a can be made clear as follows. When we 
introduce a division by intervals dt that are so small that each dt at most 
contains one change, the number v T of changes from B to B during the time 
r is given by 

T 

vx = N(u* e B) . (u t+dt € B) 

<-0 


if t assumes the discrete values 0, dt, 2 dt, ... 1 . The value v T is independent 
of the length of dt . Now we have 


adt= P(A B,B dt ) = lim 

T -+00 


N(u‘ 
_/ = 0 

eB).(u ,+dt 

1 I 

»£T 

vu 

dt 


r i 

vu 

Is 

- ^1 

1_1 

dt 


ll T 

= lim l T — L{u l e B) 


(17 a) 


If we drop dt on both sides and reverse the fractions, we have 


- - lim- 

a r-> co Vr 


(176) 


On the right side we have the quotient of the total length of all distances 
in B divided by the number of these distances, that is, the mean value of 
the distances; its limit corresponds to the average. This is the statistical 
meaning of (14). 

Let us introduce the notation 

v = lim-^ T (18) 

T —* CO J 

v means the average number of changes from B to B during the unit of time, 
or the average number of one-sided changes during the unit of time; we call 
v the alternation frequency. The corresponding number of two-sided changes, 
or the total alternation frequency , is twice as large. Multiplying numerator and 
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denominator of the expression on the right side of (176) by - and reversing 
the fractions, we obtain 7 

1 

-V T 

a — lirn f _ _ v _ 

“i i, p 

T 

ap = v (19) 

This is a very simple relation holding between the alternation frequency v, 
the mean probability p, and the transfer coefficient a ; the alternation fre¬ 
quency is, for given p t determined by a. Tlierefore we call a the relative 
measure of alternation; it measures the alternation with respect to p. For 
statistical calculations (19) is used, conversely, to determine a, since p can be 
found empirically by a simple measurement of time intervals and v by enumer¬ 
ation. This method represents a very simple statistical determination of the 
transfer coefficient a and thereby of the exponent in the exponential laws 
(9) and (10). 

The corresponding statement is valid for which likewise may be called 
the relative measure of alternation , since the relation 

0(1 - p) = * (20) 

can be derived in analogy to (19). The quantity /6 thus measures the alter¬ 
nation with respect to 1 — p. The identity of (19) and (20) is warranted by 
(7). When we denote by t the time of the stay in B } we have, correspondingly, 


M(t) = - (21) 

1 

In the example of the life line of a person, for which B denotes illness, ~ repre¬ 
sents the average duration of an attack of illness; v is the number of attacks 
of illness that the person suffers during the unit of time. The total duration 
of all attacks of illness is measured, in comparison to the length of life, by 

1 - p. 

The insurance company that pays financial subsidies during the period of 
illness takes 1 — p into account as the quantity determining the insurance 
premium; but with respect to the physical constitution of the person the 
quantity p is also a characteristic, since ,3 measures the probability that the 
person falls ill. If the coefficient ft is large for a given value of p ) the frequency 
of attacks of illness is high and the average duration of the attack is short. 

The example concerning the theory of gases supplies an illustrative inter¬ 
pretation of v if we understand by B the collision of one molecule with another. 
Then v represents the number of collisions in the time unit. If we assume the 
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duration of a collision B to be very short, compared to the time period during 
which no collisions occur, we derive from (3) that p is nearly equal to 1. 
Therefore (3 must be very large, and according to (19) we have approxi¬ 
mately a — v. This relation, well known in the theory of gases, likewise 
provides a good illustration of the measure of alternation a. According to 
1 

(14), the quantity ~ represents the mean duration of the free path, that is, 

a path without collisions. This result agrees with the relation a. = v if the 

1 

duration of individual collisions can be neglected. When we multiply ~ by 

the mean velocity, we obtain the length of the mean free path (or, more 
correctly, one of the possible definitions of the mean free path). 

The formulas developed can be given a further interpretation when we 
consider a lattice of continuous probability sequences. The number of se¬ 
quences itself may be discrete, that is, characterized by a discrete superscript k. 
If we make the further assumption that the lattice is lattice-invariant, the 
phase probabilities counted horizontally are realized also in the vertical 
direction. These conditions correspond to the case of a lattice in which the 
horizontal sequences represent the world lines of gas molecules. For the 
analysis of such a lattice it is advisable to use, first, enumeration by sections 
in the horizontal direction, since we may assume that the sequences are 
regular-invariant. 

If sections of the length t are cut off, the relative frequency of sections 
lying completely within B is given by e~ at , according to (9); they will be 
called “sections of the first kind”. The relative frequency of the other sec¬ 
tions, or “sections of the second kind”, is given by 1 — e~ at . On the other 
hand, we can count the number of changes by counting for each section the 
number v i of one-sided changes; if n is the number of sections, their sum is 
given by 

n 

E v% ~ V r T — tit (22) 

*-l 

Therefore their average is represented by 

M(v i ) i = v< = li™ — t'r = t • limb, = tv (23) 

71—> cd tl t== l 7J—*■ oo tl r -*°° “T 

This result is in agreement with the definition of v, since v represents the 
number of changes for t = 1. For n sections the number of one-sided changes 
will be given approximately by ntv. This differs from the number n(l — e~ at ) 
of the sections of the second kind, because the latter contain partly no change, 
partly more than one change. 
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Second, we introduce lattice enumeration, adding the assumption [as for 
(9, § 34)] that the first element / = 0 of all horizontal sequences belongs to B. 
(The lattice is then not homogeneous.) Because of the lattice invariance, the 
frequency, counted vertically, of sections of the first kind of the length i, all 
of which begin with t = 0, is given by e~ at ; the frequency of sections of the 
second kind is given by 1 — e~ at . The summation of the changes contained 
in all the sections does not admit immediately of the interpretation (22), 
since these sections do not combine into one continuous sequence. But (23) 
is valid nonetheless, since the summation, because of the lattice invariance, 
gives the same result as the enumeration by sections; thus the number v can 
be determined in the vertical direction also. In the relation (19) we must 
substitute for p the value of the horizontal sequences. 

We can now consider a case that is important in many applications, re¬ 
sulting from a degeneration of the lattice. Let the horizontal sequences 
represent the world lines of radioactive atoms, B the disintegration of the 
atom; then the horizontal sequences are degenerated, because they will no 
longer return to B after the event B has once occurred. The definition (3) 
therefore supplies P(A,B kt ) 1 = p = 0. Correspondingly, we infer from (7) 
that also f$ = 0; and (19) gives v = 0. The lattice is therefore not lattice- 
invariant; but this very circumstance makes it possible that some of the 
relations derived previously remain valid in the vertical direction. Thus the 
vertical sequences realize the probability /(/), because the infinitesimal rela¬ 
tions (6) remain valid in the vertical direction and determine a, and because 
the factorization of (8a), stated in (8 b) t holds for vertical enumeration. 
Equations (9) and (10) then represent the familiar law of radioactive disinte¬ 
gration. If there are N atoms present, then during the time t approximately 

N * (1 ~ e-« l ) (24) 

atoms will disintegrate, according to (10). At the end of this time only 

N • e~ at (25) 

atoms will be left over. The quantity a is called the constant of disintegration; 
it can be determined only by enumeration in the vertical direction. The rela¬ 
tion (14) remains valid, too, for enumeration in the vertical direction. The 
average number of changes for a section of the length t in the vertical direction 
is identical with the number of sections of the second kind, since in the hori¬ 
zontal direction there can never occur more than one change; and, because of 
the assumption concerning the first element of the sequences, each section of 
the second kind must contain at least one change. This number, therefore, 
is 1 — e~ at . The number, however, is not identical with the number vt y as 
defined earlier; since the definition of v in the horizontal direction leads to 
v = 0 and the relation (19) is fulfilled only for this value. 
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In another example the horizontal sequences may represent the life lines 
of persons and B the death resulting from a traffic accident. Since this example 
displays the same degeneration as the previous one, a is determined, once 
more, only by enumeration in the vertical direction. If we put dt = 1 year, 
we have a — rrl.o^oirn ~ ttiIToj according to p. 183 (footnote). In this 
case we may apply the approximation (11): the probability of such an acci¬ 
dent in the course of two years is twice as great as for one year. This propor¬ 
tionality, however, holds for short periods only; for long periods the strict 
law (10) is to be employed. The usual treatment of social statistics by the 
use of the law of proportionality (11), therefore, provides merely an approx¬ 
imation. The quotients obtained from social statistics must be interpreted, 
strictly speaking, as transfer coefficients of continuous probability sequences. 
The necessity of this interpretation is even more clearly visible for nonde¬ 
generated cases, such as are supplied, for instance, by the number of attacks 
of illness. 

The examples show that continuous probability sequences are of great 
practical importance. They supply an instrument permitting us to interpret 
the continuous lines of causal connection, or causal chains , as probability 
sequences. Through this interpretation the continuous probability sequences, 
besides their practical applicability, acquire a fundamental importance. 

The usual conception of the causal chain as a strictly determined connec¬ 
tion between events, the course of which can be predicted with certainty 
from given initial conditions for any subsequent length of time, has turned out 
to represent an untenable idealization, which cannot do justice to the facts 
gathered in natural sciences. Since all the laws of nature possess the form of 
probability implications, the causal chain must be represented by a continuous 
sequence in which the later elements are determined by the preceding ones 
with a certain probability only. The particular form of the determination 
will be of the type studied here: at every point event the ensuing occurrences 
are determined with certainty only for an infinitesimal period of time, whereas 
for a finite length of time they are determined only with a probability that 
decreases continuously with increasing time. This statement holds for classical 
as well as for quantum physics, since, even if Heisenberg’s uncertainty rela¬ 
tions are not taken into account, the concept of causal chain must be described 
by a process of convergence exhibiting the following form: when we wish to 
predict at the time to the state of a phenomenon for the later time t with the 
probability 1 — 5 (8 small), we can specify at the time to an observational 
exactness G that achieves this result; but for the same exactness G a time 
t x > t can be specified for which the probability of the prediction is smaller 
than a given small value e. Only this double assertion can characterize ex¬ 
haustively the process of convergence, and we recognize that the idea of 
determinism seems scarcely justifiable even in classical physics. 
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The result holds to a much higher degree in Heisenberg’s quantum mechan¬ 
ics, in which it is impossible, in general, to increase the probability of predic¬ 
tion as close to certainty as we wish, as soon as finite time intervals are 
considered, however small they may be. In quantum physics, therefore, not 
only every actually formulated causal chain is represented by a continuous 
probability sequence, but this character is also to bo ascribed to the limit 
that the formulated causal chains approach when the exactness of the ob¬ 
servation is made as great as possible. 4 The zigzag path of the Brownian 
motion must therefore be regarded as the prototype of every causal chain. 


§ 47. Competition of Chances 

The developed methods will now be used for the treatment of some problems 
of practical importance. 

Assume that a man is fishing at a pond. The sequence of events is a con¬ 
tinuous sequence of time points; if the man catches a fish, the time point 
has the attribute B } otherwise B. We may idealize the problem by assuming 
that the catching of the fish is an event of so short a duration that it fills 
only one time point. The probability of getting a fish increases continuously 
with time; considered for any moment, however, it has the same value, and 
therefore we may write, in correspondence w r ith ((>, § 46), 

P(A . B,B dt ) = frit (1) 

By considerations like those applied in (8-10, § 46) we derive the formula 

P(A,B At ) = 1 - (2) 


for the probability of catching a fish during the time At. 

Assume that there is a second pond at which another man fishes. We shall 

distinguish the two men by the subscripts i and 2 ; and, assuming that both 

men fish during the same period At, we shall omit the subscript A£ of B. 

Thus we write . ... 

P(Ai,Bi) = 1 — = p 


P(A 2 ,B 2 ) = 1 - e~y At 


(3) 


The constants and y measure the fishing chances of each man. 

Now introduce the assumption that each pond contains only one fish. 
Formulas (l)-(3) apply to this case as well. For the statistical interpretation 
of the probabilities we then assume that the fishing experiment is made 
during many periods At. If the fish is caught in one period, no new fish is 
added during that period; for the following period, however, a new fish is 

* Further explanations by the author are found in Erkenntnis , Vol. I (1930), p. 158; Vol. 
II (1931), p. 156; Vol. Ill (1932), p. 32; N aturwissenschaften , Vol. 19 (1931), p. 713; and 
Philosophic Foundations of Quantum Mechanics (Berkeley, 1944), § 1. 
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put into the pond. The probabilities q and p then represent frequencies 
counted in terms of the periods Af 

What is the probability that at least one man catches a fish during A tl 
Since we can assume independence, we have for this probability the value 

P(A X .A 2} B X v B 2 ) = p + q - l>q (A) 

We now turn to a new problem. Assume that there is only one pond with 
one fish, and that both men fish at the pond. So long as neither man has 
caught the fish, the chances of catching it are the same as before for each man. 
But if one man catches the fish, the other can no longer get it; the chances 
are diminished for each man by the; competition of the other. Thus there is a 
competition of chances. 

The probabilities (3) must now be denoted in a different way. The prob¬ 
ability p exists for the first man only when the second does not fish, and 
vice versa for the probability q. We therefore must write 

= 1 - e-**' = p (5) 

P(A 2 .A h B 2 ) = 1 - = q 

What is now the probability that the fish is caught during the time A/? 
We can easily see that the answer (4) remains correct. To show this, let us 
return to the arrangement with two ponds and assume that, if both men 
catch the fish during the same period, only that man is given credit for 
catching who catches the fish first, and that he also gets credit if the other 
does not catch his fish. The probability that credit is given is then obviously 
represented by (4), since in the frequency counting of the cases B x V B 2 the 
case where both B x and B 2 are true is counted only once anyway. Now when 
both men fish at the same pond, the catching of the fish by one man corre¬ 
sponds to the receiving of credit in the other arrangement. As the credit 
received by one man deprives the other of the chance of getting it, so does 
the fish caught by one man deprive the other of the chance of getting it. 
Therefore (4) also represents the probability of the fish being caught if only 
one pond is used. 

It is more difficult to compute the probability P(A J .A 2 ,B 1 ), that is, the 
probability that the first man catches the fish if the other man fishes simul¬ 
taneously at the same pond. Since the events Bi and B 2 are now exclusive, 
we have 

P(Ai.A 2 ,BA + PiAiMfi*) = P{A,.A 2l B^B 2 ) (6) 

In combination with (4) this relation indicates that 

P(A 1 .A 2} B 1 )<p 
P(A l .A 2y B 2 )<q 


(7) 
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We therefore have a reduction of the individual probabilities through the 
competition of chances. 

The value of P{A\.A^B{) can be found when we return to the arrange¬ 
ment with two ponds and ask for the probability that the first man catches 
the fish before the other man catches his. The probability can be constructed 
as follows. Because of the transfer character of the probability sequence and 
the independence of the two fishing processes, we have, in a generalization 
of ( 1 ), 


PiAx.Av.BV . . . B[ n ~ x)i ‘ .Bi . . . Ht~ 1)d ‘,Bi d ‘) = pdl 


P{A 1 .A i .Bf t 


jj(n—\)dt j^dt 


= y(U 


( 8 ) 


The probability that both men catch their fish in the same interval dt then 
is — fiydt 2 and is thus of the second order; it may therefore be neglected. 
Now the case that the first man catches the fish in one of the intervals 
dt h dto , . . . dt n may be written 


(Bf V B\ dt V . . . V B" d( ) m (Bf V Bi*. fit 1 .Bi dl M 


vSf 


jj^(n—i)<n 


g(n-l ><*<£»<») 


(9) 


Since the terms on the right side of (9) are exclusive, the corresponding 
probabilities may be added, and we have, for the probability r that the first 
man catches the fish in one of the intervals dt h dt 2 , . . . dt ny 

r = P(A 1 .A 2 ,B'{ 1 \1 Bf dt V . . . V B?‘) 

= PiAt.A^B?) + P(A l .A.,,Bf‘.Bt i .Bl d ‘) 


+ . . . + PiAi.At.Si 1 . . . 5i"- 1 >d ‘.5f 


B, 


(n —1 )dt 


.b?‘) 


( 10 ) 


For the combination terms on the right side we may employ the special 
theorem of multiplication, since we assume that the two fishing processes are 
independent. Using ( 8 ), we thus have 


r — fidt + (1 — pdt) • (1 - t dt) ■ pdt +.. . + (1 - pdt) n ~' ■ (1 
(1 - /3dt) n ( 1 - 7 dt) n - 1 


= &dt 


(1 - /3d<)( 1 - ydt) - 1 
13 


P + 7 — 0ydt 


[1 - (1 - pdt) n (] - 7 dty] 


7 dt)"~ l ■ fidt 


( 11 ) 


The term 0ydt in the denominator can be neglected, since it goes to zero with 
A t 

dt 0. Putting dt = — we have 
n 



253 


§ 47. COMPETITION OF CHANCES 

and using the definition of the number e, we have 



Therefore (II) assumes the form 

r = —— 11 - e-v+y^‘] (14) 

0 + 7 

Now the probability P(Ai.A 2 ,Bi), referring to the arrangement with only 
one pond, is obviously identical with r; in fact, the relation ( 10 ) applies like¬ 
wise to this arrangement. The consideration of the two-pond arrangement 
acts only as a help in illustrating the existing relations. The independence of 
the two fishing processes, even when performed at the same pond, is expressed 
in ( 8 ), relations which say that so long as the fish is not caught the chances 
for each man are not changed by the presence of the other man; only when 
the fish is caught does one process interfere with the other. We therefore have 


PiAi.A^Bi) = r = ~~ [1 - (15) 

P ~r y 

By similar considerations we derive 

P(Ai.AM = s = — 7 — [1 - e -«>+ 7 )Ar] (16) 

PT7 


Now it is easily verified that, with the meaning of p and q defined in (5), 
the brackets in (15) and (16) have the form p + q — pq , so that we finally 
arrive at the result 


P(Ai.A 2f Bi) 


0 

0 + 7 


(P + q ~ pq) 


(17) 


P(Ai.A t ,B t ) = s = (p + q ~ pq) 


We see that the relations ( 6 ) and (4) are satisfied. The probabilities r and s 
may be called reduced probabilities , in comparison to the probabilities p and q; 
they are reduced by the competition of chances. 

Since (5) supplies 

1 

0 = - ^ log e (l - p) 


y = 


l 

At loge(1 ~ q) 


( 18 ) 
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we can write (17) also in the form 


P{Ai.A«,Bi) = r = r; 

P(Ai.A 2 ,B 2 ) = s 


log,(l - V) 


log«(l - V) + l»Sc(l - q) 
log/ l - q) 


loge(l - p) + log«(l - q) 
Because of the logarithmic series 

V 2 v z 

logc(l 

we have for small values p and q the approximation 


(p + q- pq ) 
(P + q - pq) 


(19) 


( 20 ) 


r ~ _JL_ (p + q - pq ) 

p + q 

- - (p + q- pq) 

p + q 


( 21 ) 


For larger values p and q , however, the approximation (21) cannot he used. 

The rules of the competition of chances find many applications. When 
two antiaircraft batteries fire at one plane, formulas (4), (5), and (17) apply. 
Or if a man suffers from two diseases, the probability of his death is given by 
(4), whereas the probability of his death by either disease is reduced by the 
competition of the other disease. In all such problems, the probability of the 
disjunction is not affected by the competition, whereas the probability of 
the individual attribute is reduced thereby. 

These results may now be applied to the discussion of a numerical problem, 
which also offers some interest from another angle in that it illustrates the 
fact that the juristic inferences used in cases of circumstantial evidence can 
be analyzed in terms of the calculus of probability. The reader may try to 
solve the following problem for himself before he studies the solution given 
below. 

When Dr. Bergmann returned to the Tyrolese village, he immediately 
reported to the gendarmerie that his wife had fallen down the 3,000-foot 
precipice. Her foot had slipped, he said, when they were crossing the narrow 
crest. A crew of experienced guides was dispatched, accompanied by Dr. 
Bergmann. After seven hours they returned, carrying the dead woman on a 
stretcher. 

One of the guides expressed astonishment at the fact that Dr. Bergmann 
had not roped his wife while traversing the crest. He said that guides always 
roped tourists at that crossing; he estimated that one out of ten persons would 
slip on the icy stones at that place and was then held only by the rope. 
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Dr. Bergmann, who was an experienced mountaineer, answered that the 
crossing had not appeared hazardous to him, and the guides admitted that 
the danger could not be foreseen by anyone not familiar with the location. 

The gendarmerie chief reported the case as an unfortunate accident. The 
situation changed, however, when sometime later it was learned that Dr. 
Bcrgmann, six months before the accident, had taken out a life-insurance 
policy for $20,000—in the mime of his wife. An inquiry at the office of the 
insurance company brought out the fact that, when a man takes out a large 
amount of life insurance for his wife in a childless marriage, he does so, in 
four out of five cases, with criminal intentions. Dr. Bergmann and his wife 
had no children. 

When questioned, Dr. Bergmann ridiculed the suspicion of murder and 
advanced the objection that within the last few months he had made another 
equally hazardous mountain trip with his wife. Furthermore, as a doctor he 
would have had ample opportunity of taking her life. During a recent attack 
of gallstones he had been obliged to give her a morphine injection, and could 
easily have administered an overdose if he had intended to murder her. 

What is the probability that Dr. Bergmann killed his wife by pushing her 
down the crest or making her slip in some way? 

We denote by Ai the situation of a person exposed to murder by a man in 
the situation of Dr. Bergmann during the time A£ of the crossing of the 
mountain crest. By A > we denote the situation of a person exposed to fatal 
accident on the mountain crest during the time A f. B\ may represent death 
by murder; B 2 , death by accident. Since we know that, in cither case, the 
disjunction B ] V B 2 is true, the probability sought after has the form 

P(A 1 .A 2 .[B 1 V B 2 \,Bi) (22) 

Because of the analytic equivalence 

([Bx VBJ.ft s /*,) (23) 

we can write, using the general theorem of multiplication, 

P(A*-AM 

vBtllh) - P{Ax Ai Bi v/ y (24) 

The difficulty consists in the correct evaluation of the two probabilities 
occurring on the right side of (24). 

In looking for the numerical values, we must use the given data as best 
we can, realizing that we can expect to find only crude appraisals, since 
extensive statistics are not available. The value 0.8 given by the insurance 
company for the probability of murder is the best value we can use, although 
a psychological analysis of Dr. Bergmann may lead to the result that for him 
a much higher, or much lower, value would be appropriate. So long as such 
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an analysis is not known to us we must employ the known data. Insurance 
murders are usually committed not very long after the insurance is taken out; 
we shall assume, therefore, that the probability that Dr. Bergmann will murder 
his wife during the six-month period is = 0.8. Most of the time he will have 
no opportunity to commit murder, since it is not easy to do so without leaving 
clues. Assume that the three occasions mentioned in Dr. Bergmann’s defense 
exhaust his opportunities for committing murder, and that they can be given 
equal weight. The probability of 0.8 may then be equally distributed over 
the three occasions. 

For this purpose, we regard the probability of murder as progressing with 
time according to the form 1 — e~ fit y the argument t running, however, only 
through the three occasions. Equal weights for the occasions may be repre¬ 
sented as equality of the time intervals At covered by them. Putting At = 1 
(this is only a convention for the time unit), we thus have, with w = 0.8, 

w = 1 - = 0.8 (25) 

which, with the use of (18), gives 

P — — } log c (l ” w ) — 0.53G (20) 


The probability of murder during any of the three occasions, on the condition 
that no murder was committed on one of the previous occasions, is then 
given by 1 


p = 1 - e -e = 0.30 


(27) 


The value (27) may be regarded as giving the probability P(A h Bi); but since 
the statistics of insurance companies were prevalently compiled from cases 
in which the chance of death was not in competition with the chance of a 
serious accident, the value (27) may also be interpreted as representing the 
probability P(Ai. A 2 ,i?i), corresponding to the meaning of p in (5). The two 
values will not differ very much, since P{A\ y A 2 ) may be assumed to be small; 
the reason is that the rule of elimination (2, § 19) gives 


P(A h B0 = P(A h A 2 ) • PiAt.A^BO + P(A h A 2 ) • P(A 1 .i 2 ,P 1 ) 


- PUiMBd+PUuA*) • [PUiMfr) - PiAt.Aifr)] (28) 

1 It might appear paradoxical that, if for each of the three periods we have the probability 
0.36, the sum of the three values is > 1. But these probabilities are not additive. Let A 
denote the situation of Dr. Bergmann after taking out life insurance for his wife. For the first 
period we have P(A,Bi& 1 ) — 0.36; for the second, PiA.B^^Bi 2 ^ 1 ) — 0.36; and so on. That 
is, the value 0.36 is dependent on the condition that the murder lias not occurred before. 
Seen from the beginning of the total time period, the probability of a murder in the first 
period is PiAyBA*) and that of a murder in the second period is PiAjBi&t.B^*), since the 
murder can occur in the second period only if it did not occur in the first. The latter proba¬ 
bility is therefore smaller than the first. Seen from the beginning of the second period, 
however, when we know that the murder has not occurred in the first, the probability of a 
murder! in the second period is PiA.Bi&^Bi 2 **), and this value is equal to P(A,B i^<). 
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an expression in which the second term may be neglected for small values 
P(Aj,A 2 ). The probability P(At.A 2 ,5i), however, can then be very different 
from P(AM 

For the chance of accident the value 0.1 is given, representing the prob¬ 
ability P(A 2 .A],B 2 ) = g, since these statistics were certainly made with 
persons not exposed to an insurance murder, which was excluded by the 
presence of guides. Since the accident chance operates during the period 
At = 1, we derive from (18) 

y — — log,(l - q) = - log,41 - 0.1) = 0.105 (29) 

The two chances of death, that by murder and that by accident, are in com¬ 
petition during the passage over the mountain crest, and we therefore have 
here the conditions expressed in formulas (4) and (17). Applying these for¬ 
mulas to (24), we obtain 


P(A 1 .A 2 .[P 1 VP 2 ],P 1 ) 


(3 _ 0.536 

0 + y~ 0.536 + 0.105 


(30) 


The probability of murder, on the basis of the known facts, is therefore 84%. 
What makes this value even higher than the antecedent probability of 80% 
is the fact that the insured person died, and not of a natural death. That the 
resulting inverse probability is only slightly higher is chiefly a consequence 
of Ur. Bergmann’s excellent defense, which distributes the high antecedent 
probability of 80% over several periods. 

This analysis of the problem corresponds to the weights we would give 
instinctively to the different factors involved. It is superior to an instinctive 
appraisal in that it supplies the resultant of all the factors by mathematical 
computation. The precision of the numerical result, of course, should not be 
overestimated. Based on rough statistical estimates and omitting any psycho¬ 
logical analysis of the man involved, it will represent no more than the best 
judgment available. And it is scarcely necessary to say that the characters and 
the numerical values of the story are fictitious. 
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§ 48. The Frequency Sequences 

The questions to be considered in this section correspond somewhat to the 
questions on the structure of order, which was discussed in chapter 4. The 
problems of the theory of order were concerned with questions about the 
probabilities in subsequences derived in a certain manner from the major 
sequence or from a lattice of major sequences. It was in terms of these prob¬ 
abilities that the 'properties of order of probability sequences were expressed. 
A number of types of order were thus defined, by means of which the total 
domain of probability sequences was subdivided with respect to their struc¬ 
ture of order. 

In the present chapter the frequency properties of probability sequences 
will be characterized by similar methods, as, once more, certain sequences 
derived from the major probability sequence are considered. This investiga¬ 
tion, however, differs from the theory of order in employing, not subsequences 
of the major sequence, but certain other sequences derived from it, to be 
called coordinated frequency sequences. In defining this term one must dis¬ 
tinguish three kinds of frequency sequences. In order to simplify the dis¬ 
cussion, we shall assume the sequence A to be compact. 

First, we can coordinate to each element y n of the sequence the amount 

/" = F n (A,B) (1) 

The sequence of the amounts f n may be called the frequency sequence counted 
through. This is the most important of the frequency sequences. 

Second, we can coordinate to each element of the sequence the frequency 
f n according to (1), but counted only for the preceding segment of the length n, 
so that the enumeration of the frequencies begins at 2 /»_ n+1 and ends at y t -. 
We thus count the frequency within segments overlapping each other. The 
sequence of values f n thus constructed will be called a frequency sequence 
counted, in overlapping segments. 

Third, we can divide the sequence into sections of the length n that do 
not overlap; we then coordinate to each section the frequency f n according 
to (1). Thus an amount f n is coordinated to each element yi for which i is a 
multiple of n. The sequence of the values f n thus selected will be called a fre¬ 
quency sequence counted in sections . 
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The frequency properties of probability sequences will be characterized by 
means of certain statements about the 'probabilities in coordinated frequency 
sequences. For instance, we may inquire into the probability that the fre¬ 
quency assumes certain specified values or that it lies within a certain specified 
interval. Thus we arrive at probability statements concerning coordinated 
frequency sequences. It is of great import that such questions can be answered 
not only in the interpreted but also in the formal calculus of probability. 

It is true that in the formal calculus no certainty statements can be made 
about the structure of coordinated frequency sequences; nevertheless, we can 
make probability statements about the properties of frequency sequences by 
calculating the probability of finite sections or segments according to the 
rules for the probability of combinations. Thus this investigation answers 
the question concerning the kind of statements about frequencies that can 
be made in the formal calculus, that is, without the use of the frequency 
interpretation. It will turn out that these statements assume the character 
of statements of convergence. 

But the results are important also for the interpreted probability calculus. 
The convergence of frequency sequences, which the frequency interpretation 
asserts, is not subject to quantitative restrictions stipulating the degree of 
convergence for a given n ; we obtain statements about the quality of the 
convergence only when we have recourse to probability statements about the 
convergence of frequency sequences. 

Since the question of the probability of frequencies leads to problems of 
convergence, its answer concerns problems important in practical statistics; 
the degree of convergence of frequency sequences naturally plays a prominent 
part in practical applications. Another advantage of the method is that it 
leads to further classification of types of probability sequences, this time 
defined in terms of the kind of convergence. As a consequence, questions will 
arise how far the types that were developed previously coincide with the 
convergence types. Thus frequency considerations lead to a more profound 
understanding of the structure of order of probability sequences. 

§ 49. The Theorem of Bernoulli 

The principal theorem among the results of frequency considerations is the 
theorem of Bernoulli. Like all the theorems previously presented, it will be 
developed first in the formal calculus of probability; its meaning can then 
be stated easily in the interpreted calculus. In this section the derivation 
will be given only for normal sequences, so that the special theorem of multi¬ 
plication is applicable. An extension to other than normal sequences will be 
explained later. 

The questions raised by this theorem start with the probability of combina¬ 
tions, and thus it is advisable to develop the theorem first for frequency 
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sequences counted in segments. Later it will be extended to frequency 
sequences counted in other ways. Assume 

P(A,B) =p (1) 

We inquire into the probability of the occurrence of a combination of m ele¬ 
ments B with (n — rn) subsequent elements B, that is, a combination of the 

form t> fj +1 T) /o\ 

B l . . . B m .B m + l . . . B n (2) 

The probability of this combination is 

P(A,B l . . . B m .B m + l . . . B ") = p m {\ - p) n ~ m (3) 


This combination of n elements has the property that the frequency occurring 
in it is given by 


/ = 


m 


(4) 


The notation is to be understood as follows. As in §§ 35 and 43, wo distinguish 
the individual value/ 1 of the frequency from its possible amounts. For a given n 
the amounts are. discrete stops determined by the successive values of m, so 
that we would have to write 1 f m for the step, in correspondence to the amount 
u m in § 35. However, if we regard n as a variable, that is, as representing any 

chosen integer, it is advisable to regard the possible amount - as a eon- 

n 

tinuous variable' / (without subscript), similar to the amount u in § 43. In 

rn 

doing so we consider the continuous background of all rational value's - as 

n, 

representing the range of the variable /. 

The combination (3) is not the only one having the frequency (4); the same 
frequency / is produced by different combinations, provided they have* the 
same total number of elements, and among them an equal number of elements 
B , that is, the same n and m. IIow many such combinations are possible? 
Their number may be designated by k; it is calculated as follows. If we 
permutate the terms of (2), each keeping its superscript, all the possible permu- 

m 

tations of the* n terms represent combinations of the same frequency / = —. 

n 

How many permutations of this kind exist? The answer is seen from the 
schema presented. If there is only one term, there are two 
• 1 • empty places, designated by points, that may be occupied 

• 1 • 2 • by a second term. For two terms, therefore, we have 1 • 2 

• 1 • 2 • 3 • permutations. If two terms are given, three empty places 

are available for a third term (second row of the schema), 
that is, three arrangements are possible. Since this holds for each of the two 
arrangements in which the two terms may occur, the total number of permu- 
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tations of three terms obtains as 1-2*3. Continuing the inference, we 
arrive at the; result that for n terms there are nl permutations. 

This, however, is more than the desired number k; this number is rather 
obtained by the permutations of (2) resulting when the superscript does not 
participate in the permutation. In the permutation of (2) with superscripts, 
permutations that are distinguished only by the order of their superscripts 
are counted as different, while they have the same order in respect to B and 
to B. Obviously, the number of such arrangements is given by the permuta¬ 
tion of the m elements B among themselves and of the (n — m) elements B 
among themselves, that is, by ml and (n — m )! respectively. The number 
must not be subtracted, however; we must divide by it, since each arrange¬ 
ment without superscript supplies m\(n — m )! arrangements that contrib¬ 
ute to the permutations with superscript. Thus, between the k arrange¬ 
ments without superscript and the nl arrangements with superscript, we 


have the relation 


kml(n — m)l = nl 


that is, 



(5) 


It is an advantage for the mathematical evaluation that k happens to be 

determined by the binomial coefficients 

We now wish to find the probability of obtaining any one of the k arrange¬ 
ments. The question concerns the probability of a disjunction. We introduce 
the abbreviation 

F* = D f [B l . . . B m .B' n + l . . . B n ] V . . . 

V[^ . . . B n ~ m .B n ~ m41 . . . B”] (0) 

in which the terms of the disjunction consist of all the combinations that 
contain m letters B and (n — m) letters B in any arrangement. We read F* 
as “frequency m for B among n elements”. To the case F* is then coordi- 
m 

nated the amount / = —. Since all the k terms of the disjunction (6) have the 
n 

same probability (3), we obtain with the help of (5) the binomial distribution 
P{A,FZ) = P"(l - V)"- m = V>nm (7) 

This expression is called Newton’s formula. 1 Although it appears to have a 

1 For a disjunction of r terms having the probabilities pi ... p r Newton’s formula 
assumes the extended form of the multinomial distribution: 

n ! n 

W ni . . .n r = — -.Pi 1 • • • V r n x + . . . + n T = n 

1 r 7^! . . . n r \ 

where wi. . . n r are the frequencies, respectively, of the r cases. The proof follows according 
to the considerations used for (5). 
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simple structure, its functional properties are difficult to realize; but a graph¬ 
ical representation provides a suitable illustration of the formula. For this 
purpose we plot, for a given n, the number rn as abscissa and w nm as ordinate. 
Figure 23 presents a diagram of this kind. 1 have chosen /; = J and drawn 
the curve for n = 4 and n = 8, respectively. The end points of the ordinates 
are connected by a train of lines. The maximum of w nm is situated for the 
first case at m = 3; for the second, at m = 6. That is, in both it corresponds 
to / = f = p. But in the second case the maximum is smaller than in the 

OS 
OA 
0.3 
0.2 
0.1 

0/234567 8 

—► m 

Fig. 23. Graphical representation of Newton’s formula 
(7) for p — J and two different values of n. 

first case, since in the first we have (w nrn ) max = 0.42, and in the Second we 
find (w nn j) max — 0.31. 

The calculation of the maximum is relatively easy. Consider the quotient 
of two successive values (assuming that 0 and 1 -p^0): 

Wnm _ V ( n + 1 

iv n ,m -1 1 — V \ m 

While m varies from m — I to m = n (the value w n0 in the denominator 
is included as the case m — 1), the expression (8) decreases continually for 
increasing m; it becomes equal to 1 for 

m = p (n + 1) (9) 

The maximum of w nm is characterized by a change-over of the quotient (8) 
from values > 1 to values < 1. For the computation we must distinguish 
the following cases: 

1. p (n + 1) is an integer > 0, but < n + 1. Then the value m — p (n + 1) 
resulting from (9) determines a w nm that, because of (8), is equal to 
these two values represent the maximum. 
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2. p (n + 1) is a fraction (not an integer) > 1. Then the largest integer 
m below p {n + 1) gives the maximum w nmy since this value represents the 
largest integer rn for which the quotient (8) is > 1. Uere there is only*one 
maximum value w nm • 

3. p (ft + 1) < 1. In this case the quotient (8) is not > 1 for any possible 
value m. This means that the maximum lies at m = 0. 

The cases p = 0 and p = 1 remain to be considered: 

4. p = 1. Then the maximum is rn — n, a result that follows directly 
from (7). 

5. p = 0. Then the maximum is rn — 0, a result that also follows directly 
from (7). 

It is common to all the cases that the desired rn is equal either to p (n + 1) 
or to the next smaller integer; we can therefore represent the results collec¬ 
tively in the double inequality for rn: 

p (n + 1) ^ m ^ p(n + 1) - 1 (10) 

That integer rn or those two integers rn that satisfy (10) supply the one or 
the two maximum values w nm . Further maxima cannot exist, since (8) falls 
off continuously with increasing m. 

rn 

For the frequency / = — we obtain from (10) the limits 
n 

V + ( 11 ) 

n n 

Since the difference between the two limits amounts to only -, we can assume 
for large n that this frequency is given approximately by 

/ = V (12) 

This equation represents the first result: the frequency that corresponds to 
the probability p has the highest probability. 

The probability with which f = p is to be expected can be found by the 
calculation of the corresponding w nm . For this purpose we must substitute 
for m in (7) the value pn. The computation is achieved by the help of a 
method of approximation, based on Stirling's formula 

n\ — n n e~ n \/2Trn (13) 

This formula is often used in the calculus of probability; it converges so well 
that the deviation from the correct values of the factorial in the case n — 10 
amounts to only 8 per 1,000, becoming much smaller for larger n. If we 



§ 49. THE THEOREM OF BERNOULLI 267 

replace the factorials in (7) by the expressions given in Stirling's formula 
(13) and substitute up for m, we arrive easily at the result 

(WV " )m “ = y/2mp(\ -V) (14) 

This is the probability for the occurrence of the frequency / = p. 

This probability, although representing a maximum, is not very large; it 
even goes toward zero for increasing n. But this is not surprising, since for 
large numbers a small deviation of /, in percentage, presents a large difference 
in m. A result in terms of percentage can be very exact, whereas the corre¬ 
sponding m in its absolute value may differ by a large number from the 
optimum value of m. From the standpoint of frequency, however, the ques¬ 
tion of the precise value of /is not important; it suffices to answer the question 
with which probability / lies within an interval from f\ to / 2 . This modifica¬ 
tion means that we consider not only combinations of n elements that have 
precisely the same value m f but further combinations possessing somewhat 
different values m. We thus permit m to vary from m\ to m 2 . Consequently, 
we consider a disjunction of disjunctions , in which each disjunction occurring 
as a term has the form (6). We introduce the abbreviation 




mF 


(15) 


771 \ 

h — 

n 


fa 


m 2 
n 


which may be read: “a frequency between/i and/ 2 for B among n elements”. 
We then derive from the theorem of addition, because of (7), 


P(A,F T ‘[f i,/J) = X>w = 


(IB) 


Again it is advisable to use a graphical representation, which illustrates 
this quantity as well as the transition from (7) to (1G). However, we choose 
a representation by means of a histogram , a staircase-shaped distribution, 
which permits transition to a continuous distribution. The resulting diagram 
is presented in figures 24/1 and B , which are constructed for several values 
of n, the first two of which arc identical with the values n of figure 23 (p. 2G5). 
The first difference from figure 23 is that in figures 24A and B the abscissa 
is / and not rn as in figure 23. Therefore, all curves of figures 21/1 and B 
extend over the same domain of the abscissa, since for each n the value of / 
varies between 0 and l. The second difference from figure 23 is that the 
ordinate of figures 24A and B represents a probability density, that is, the 
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Fig. 24.4. Bernoulli curves for p = J and different values of n, according to (19) and (21). 


area of a rectangular strip of ordinates is made equal to w nm - In other words, 
wc use as ordinate a height w n (f m ), which is determined by the equation 

W n (f m ) ■ A/ = w nm (17) 

in 

In order to allocate the n + 1 possible values of — (m varying from 0 to n) 
to the abscissa, we need n + 1 intervals; thus n ' 
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Fig. 24 B. Bernoulli curves for p — l and different values of n, according to (19) and (21). 


These ordinates are represented in figures 24^4 and B. The value — to which 

n 

an interval belongs lies within the interval at a point that divides the interval 
in the ratio of m to n. The point m = 0 lies at the left end of the first interval; 

the point m = 1 lies - A/ to the right of the left end of the second interval; 
n 

and so on to the point m — n , which lies at the right end of the last interval. 
The ordinate strips supply the histogram, which takes the place of the oblique 
lines of figure 23. For the sake of clarity the steps have been drawn in outline 
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only. All staircase-shaped curves have their peaks at / = p. Figure 24A is 
drawn for p — i; figure 24 B, for p = \. The histograms are symmetrical 
only for p = \ ; for p — J the 1 columns on the right side are higher than those 
on the left. The diagram for p = \ would be the 1 mirror image of figure 24/1. 

The quantities 5«(/t,/>), too, are illustrated in figure's 24A and B. Tilery 
would be 1 represented in figure 23 by a sum of neighboring ordinates; in figure 
24 A B they are given by a sum of neighboring ordinate strips. Thus they 
are determined by areas delimited on both sides by the ordinate's belonging 
to f\ and />, and on top by the 1 staircase-shaped curve 1 . The farther apart the 
values /1 and/ 2 , the greater is 5 n (/i,/>). If we put/i = 0 and/> = 1 , the total 
area between the curve and the abscissa corresponds to the quantity b n (f ]} /»), 
for which we write 5„(0,1). This quantity represents the probability that / 
assumes some value between 0 and 1 ; the probability is equal to 1 . We there¬ 
fore have 

6 „( 0 , 1 ) = = 1 ( 20 ) 
m=^0 

That this relation follows from the definition (7) of the w nm is shown as 
follows. The expressions (7) are known from algebra as the terms of the bi¬ 
nomial expansion (45, § 24) for q = 1 — p. Since for this value 7 ; + q — 1 , 
the expression (p + q) n and therefore the sum also is — 1 . 

For large n the staircase-shaped train of lines can hardly be distinguished 
from a smooth curve. In this ease / may be regarded as a virtually continuous 
variable, to which an ordinate w n (J) is coordinated by the equation resulting 
from (19): 

7tl 

Wn(f) = (n + 1 )w nm foT f = — ( 21 ) 

In order to extend this formula to noninteger values of m and n, an interpola¬ 
tion is used for the factorials in (7), for instance, by means of Stirling^ 
formula (13). 

In the definition of &„(/i,/ 2 ) we must then replace the summation by an 
integration, obtaining 

M/i,/*) = £ 2 w„(f)df (22) 

This equation, in which the transition to the continuous distribution is car¬ 
ried through, is strictly valid only for n 00 ; but it represents a very good 
approximation for finite n } so that for large n it cannot be distinguished 
practically from the precise value (16). The function w n (f) is a probability 
density. The original step nature of / finds its expression in the fact that the 
usual approximation w n (f)df supplies a precise value if m and n are integers 
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The function w n (f) will be called Bernoulli density; the func- 


n + 1 

tion b n (f i,/ 2 ) will be called Bernoulli probability. 

In the sense of (21) and (22), the train of lines is replaced by a continuous 
curve for the case n — 04, in figure' 24 A-B. The replacement offers no diffi¬ 
culties for large n, since for n — 04 the st(*p would assume a width of only 
1 mm. in the scale of the drawing. The' quantity & n (/i,/ 2 ) is given by an area 
bounded on the sides by the ordinates of fi and / 2 and on top by the curve 
w n (f). The peak of the smoothed-out curve corresponds, of course, to the 
value/ = p. Nevertheless, we' do not speak of the probability that / is exactly 
equal to p, but of the probability that / lie's within an interval around p, 
since the probability assumes a nonzero value only in respect to an interval. 
Thus 6„(/i,/ 2 ), not w n (f), is a probability. That the probability of obtaining 
a frequency / in the neighborhood of p becomes larger with a transition to 
larger n is expressed by the fact that for larger n the curve contracts to a 
narrower region around / = p. Its total area remains equal to 1, since, accord¬ 
ing to (20), we have the condition of normalization 



(23) 


For large values of n, however, the larger part of the area is concentrated 
in a narrow vertical strip around the value/ = p. The peak value of the curve 
results from (21) and (14) as 


m n + 1 _ 

\/2 irnp(i-p) -\/2ir — p) 


(24) 


This value approaches °° with increasing n, whereas the area remains 
finite and equal to 1. The probability 6„(/i,/ 2 ), considered for an interval from 
/1 to / 2 not including the peak, becomes smaller with greater n; but if the 
interval includes the peak, the probability fr„(/i,/ 2 ) becomes larger with in¬ 
creasing n. The point f — p constitutes, therefore, the critical point of a 
bundle of Bernoulli curves. This peculiarity can be characterized as follows: 
to arrive at an interval including this point becomes more and more probable 
for increasing n ; to arrive at an interval not including the point becomes 
more and more improbable for increasing n. 

These properties are expressed in the following relations holding for the 
Bernoulli probability &„(/i,/ 2 ): 


lim &»(/i,/*) = 1 


lim £>„(/i,/ 2 ) = 0 


These relations will be proved in § 55. 


if /1 ^ V ^ /2 
if p < fi or p > f 2 


(25) 
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There are no general analytic expressions by which the curves w n (f) can 
be represented. However, it is often sufficient to restrict the consideration to 
the environment of the point f — p and to put /i = p — 5,/ 2 == p + 8. We 
then abbreviate the symbols introduced in (15) and (1G) by the definitions 


K = zv F*[p - p + 5] 

= Df l>n(p — 5 , p + S) 


(26) 


For this domain Laplace has given an approximation for the integral occurring 
in ( 22 ). The result 2 (which is strict for the limit n — «>) gives approximately 
the equation 


h nh = P(A,F n s ) 




8 \/n 

V 2 p( 1 - p) 


We see that the Bernoulli curve, in the neighborhood of its peak, is approxi¬ 
mated by a Gauss exponential function. Whereas the diagram of figure 
24 A-B (pp. 2G8-269) represents the increase of the probability for a fixed 8 
by contraction of the curve within given limits of the abscissa, (27) supplies 
a representation in which the curve e~ [i keeps its form, but in which the limits 
of the integral are extended. Therefore t is an auxiliary variable, and the 
dependence of on n finds its expression in the fact that n enters into the 
limits of a definite integral. The construction of this integral may be regarded 
as a mathematical device to represent a functional relation between and n 
and likewise between b ni and p. 

By the use of the abbreviation b nS) the first of the relations (25) can be 


written 


lim b nS = 1 


(28) 


n-*oo 


The formula is valid for any 6, however small, but 8 must remain constant 
while the transition to the limit is carried out. The meaning of (28) can be 
paraphrased as follows: ^ 

If a certain inexactness 8 is specified , the probability that the frequency f = — 

n 

of a segment of the length n will lie 'within p ± 8 can be made as closely equal 
to 1 as desired by making n larger. This holds for any chosen 5, however small, 
but for small 8 a greater value of n is to be employed if we wish to have the 
same probability for the desired frequency. 

Some numerical examples will serve to illustrate the result. In table 4 the 
heading of each vertical column specifies the value p and the exactness 8* to 
which the column belongs; the numbers given under b nS refer to the n of the 

2 The proof is found in most of the mathematical texts on probability, for instance, 
Emanuel Czuber, WahrscheinLichkeitsrechnung (2d ed.; Leipzig, 1908), p. 119. Tne theorem can 
also be derived as a special case from the theorem mentioned at the end of § 44. See Richard 
von Mises, Vorlesungen aus dem Gebiete der angewandten Mathematik y Vol. I: Wahrschein- 
lichkeitsrechnung (Leipzig, 1931), pp. 209 ff. The most important consequence, which is 
expressed in (25), is more easily derivable (see § 55). 
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respective horizontal rows. The deviation, measured in percentage of p , is 
designated by 8*, that is, p ± 8 = p( 1 +8*). 

The table shows not only the increase of b n s toward 1 but also that the 
same high value of b n & is reached at larger n for smaller 5*. Thus for p = \ 
and 5* = 20% the value b n & = 0.95 obtains for n = 100, whereas for the 
same p , but with 5* = 10%, the same value of b n & obtains for about n = 500. 


TABLE 4 


V 

H 

M 

H 

H 

% 

V* 

5 * 

20% 

10% 

20% 

10% 

20% 

10% 

n 

b n s 

bnb 

bnb 

bnb 

b n b 

b n b 

10 

0.29 

0.14 

0.47 

0.24 

0.72 

0.42 

30 

0.48 

0.24 

0.73 

0.41 

0.94 

0.66 

50 

0.59 

0.31 

0.84 

0.52 

0.99 

0.78 

100 

' 0.75 

0.44 

0.95 

0.68 

^1.00 

0.92 

500 

0.99 

0.80 

^1.00 

0.97 

^•T.OO 

~1.00 


Furthermore, for smaller p, but with the same exactness in percentage, the 
same value of b n t> is reached much later. Thus p = <5* = 10%, n = 100: 
bnb = 0.68; p = ?, 5* = 10%, n — 100: b nb = 0.44. The difference results 
from the fact that equal exactness in percentage, for a smaller probability, 

m 

means a smaller absolute value of «5. Thus 8 * = 10% means for p = \ that — 

n 

jYi 

lies within 0.5 ± 0.05, whereas it means for p — \ that — lies within 

71 

0.25 ± 0.025. 

Bernoulli was well aware of the significance of his theorem. He justly 
emphasized that it cannot be regarded as a matter of course that the prob¬ 
ability of observing the ideal frequency converges toward 1, and that this 
theorem requires “a proof based upon scientific principles”. 

Some other circumstances must be taken into account, of which perhaps no one has 
even thought, so far. It remains to investigate whether with the increase of the number 
of observations there likewise increases continually the probability of obtaining the true 
ratio for the number of favorable to the number of unfavorable observations; and whether 
this happens in such a manner that this probability finally surpasses any chosen degree 
of certainty, or whether the problem, so to speak, has its asymptote, that is, whether there 
exists a definite degree of exactness . . . that can never be surpassed, however much we 
increase the number of observations. . . . This is the problem I intend to publish at this 
place, after I have carried it around with me for twenty years; its novelty as well as its 
extraordinary usefulness, connected as it is with an equally great difficulty, will make all 
other aspects of the theory gain in importance and significance. 3 

The bearing of the theorem upon the calculus of probability will now be 
investigated in more detail. 


3 Jacob Bernoulli, Ars conjeciandi (Basel, 1713), Part 4, chap. iv. 
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§ 50. The Significance of Bernoulli's Theorem 

The function of Bernoulli's theorem is to supply frequency statements even 
within the formal conception of the probability calculus. 

Since the quantity b n s, according to (27, § 49), is defined in the formal 
calculus of probability by 

On5 = Df 1 5) u; 

the Bernoulli theorem as expressed in (28, § 49) leads to the following result: 

Any given probability referring to a normal sequence can be transformed 
into a probability of a higher kind, the numerical value of which can be 
made as high as desired, whereas the first probability is interpreted as a fre¬ 
quency. Thus the statement of the first kind, “The event is to be expected 
with the probability £’’r can he replaced, according to Bernoulli, by the 
statement of the second kind, “Among 500 repetitions the event is to be 
expected in a frequency between 225 and 275 with the probability 0.97". 
Here the number \ has changed from a probability to a frequency; it deter¬ 
mines the middle point 250 between the two limits as a function of the total 
number 500, and a new probability of the value 0.97 is introduced, which 
is a high probability. 

In this transformation the quantities n and 8 play the role of arbitrarily 
assumed parameters. The occurrence of the parameters means that we can 
coordinate to the statement of the first kind, not only one statement of the 
second kind, but a class of such statements. Thus in the example we could 
select as a statement of the second kind, “Among 100 repetitions the event 
is to be expected in a frequency between 40 and 60 with the probability 
0.95". For the first-mentioned statement of the second kind, n — 500, 
5* = 10%, have been chosen; for the second, n = 100, 5* — 20%. Thus the 
new probability differs in the two statements. But the transformation is 
always of the type illustrated; a probability is transformed into a frequency , 
and its place is taken over by a new probability , the degree of which can be 
made as high as desired by a suitable choice of parameters. 

The procedure may be continued: to the probability statement of the 
second kind we can coordinate a probability statement of a third kind, in 
which the probability of the second kind is transformed into a frequency, 
and a new probability of the third kind occurs. The probability of the second 
kind will then be translated into a frequency within a series of segments, or 
sections. Such frequencies will presently be encountered in connection with 
further problems. 

One fact, however, should be noticed: even with the help of the Bernoulli 
theorem it is impossible to eliminate the concept of probability and replace 
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it by a frequency. The reason is that the frequency statement resulting from 
the Bernoulli transformation, too, contains the probability concept, though 
at a different logical place. Thus Bernoulli’s theorem cannot make the fre¬ 
quency interpretation dispensable. On the contrary, since every statement 
of the formal probability calculus admits of a frequency interpretation, the 
Bernoulli theorem likewise can be given a frequency interpretation. It is the 
probability of the second kind that for this purpose is to be interpreted as a 
limit of a frequency. 

We have only to translate (1) into a frequency statement and then to 
reformulate (28, § 49). Since (1) represents a frequency in enumeration with 
overlapping, the corresponding frequency statement will be as follows: if, 
in a normal probability sequence, we consider a segment of n consecutive 
elements, which is shifted from element to element with overlapping, and 
count among the segments those whose “internal frequency” lies between 
fi and / 2 , then the frequency of the segments is, in the limit, equal to 6 nfi . 
The longer the segment, the higher is the frequency of these segments, and 
this frequency goes toward 1 with increasing length of the segments. In this 
interpretation the theorem represents a statement about a frequency sequence 
counted in overlapping segments. 

The Bernoulli theorem can be formulated also as a statement in enumera¬ 
tion by sections. According to (11, § 30), the special theorem of multiplication 
holds for this enumeration when we deal with normal sequences. Therefore 
wc have 

P(A.S\ U K) = P(A,F2) = w nm ( 2 ) 

and also, using (15 and 26, § 49), 

P(A.S\ u F*[f l9 fi]) = P(A,F»[f h f 2 }) = b n (f u fi) (3a) 

P(A.Sh,Fi) = P(A,FJ) = b n5 (36) 

In this interpretation, formula (28, § 49) states that a limit statement for 
the frequency of sections holds in enumeration by sections. This statement 
corresponds exactly to the limit statement concerning overlapping segments. 

Finally, the Bernoulli theorem can be stated by using a lattice enumera¬ 
tion, if it is applied to normal sequences in the narrower sense. In order to 
construct this interpretation, which makes the meaning of the theorem par¬ 
ticularly clear, a few remarks must be made about the symbolism to be used. 

In the definition of the abbreviation F* given in (6, § 49) the superscript n 
represents a phase. If we wish to indicate, apart from the phase, the running 
superscript i (an indication that is necessary for lattice enumeration, but can 
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be used likewise for enumeration in overlapping segments), we must write, 
instead of F", the detailed symbol 

P i-n+lli Df ^(i-n)+l Jpi-n)+tn 1 . j§(i-n)+rj] \j 

V[_^0'— n)-f 1 pU—n)+(n—m) (i—n) + (n—m) +1 n)-f-nj 

(4) 


in which the terms of the disjunction consist of all the combinations that 
contain m times B and (n — m) times B in any arrangement. In the detailed 
notation the superscript i refers not to the class B but to the element ?/*; 
therefore FJ~ n+lli represents the frequency m for B within that segment 
of n elements that ends with the element y iy that is, the frequency m for B 
from 2/i_n+i to y t \ If the running superscript is indicated outside the paren¬ 
theses, the probability referring to this frequency will be written 


p(A,Fir n+l/ y = w nm 

Similarly, we define, in analogy to (15, § 49), 

r~ n+l,i [fuh) = j>, KT" + 1 /i vv. 


V F, 


i —n -f 1 f i 


h = 


nil 


h = 


m-i 


FT n+l/ * = d/ F i - n + l/i [p-8,p+ 5] 
and write the corresponding probabilities 

P{F <-n +il y tM i = K(fhh) 

T = 6„ a 


r(A,Fi- n+1/i " 


(5) 


(G) 


(7«) 

(76) 


The symbols F™ and Fa may be regarded as abbreviations of the symbols 
F i i ~ n+l/i and Fi~ n+l/i 

In lattice enumeration the superscript Jc is added. We have for each hori¬ 


zontal sequence k, analogous to (5) and (7 a~b), 

P{A,F k ^ n+tU y = w nm (8a) 

PUS*'If h MY = 6 n (/„/,) (86) 

p(A ! FY-n + i / l) i = (gc) 

Since normal sequences in the narrower sense are lattice-invariant, according 
to (7, §34), we derive (ftfl 

,/*!)* = 6-(/i,/0 (96) 

P(4,P^- B+l/ *)‘= 6nS (9c) 
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The interpretation of these expressions becomes particularly clear when we 
make i = n, that is, when we consider the frequency of the first n elements 
of the horizontal sequences. We introduce the abbreviations 


**"(/.,/d - D f do6) 

1‘t = Df Fl' ,n (10c) 

which are permissible' because we no longer need to make the distinction 
between the phase n and the running superscript i. With i — n the relations 
(9) then assume the form 

P(A,Fi:) k = W nm (11a) 


P(A,F k ”[fiM k = b n (f h f 2 ) 


r{A,F\ n ) k = b nS (lie) 

F\ n represents the case that the frequency of the /fc-th horizontal sequence 
up to the element ijkn lies within p ± 8, that is, we have simply 

ykn*F k i n = [F\A,B k y=p±f\ ( 12 ) 

where the symbol F n (A,B kl ) 1 has the meaning introduced in (3, § 10), and the 
equality sign has the meaning “within” (see 3, § 89). 

Because of these simplifications the meaning of (11c) can be stated as 
follows. We construct for each horizontal sequence the frequency sequence 
counted through; 1 to each element yk n of the lattice is then coordinated, 
as amount, the frequency f kn , so that we have 

Jkn = (13) 

The lattice f kn will be called the coordinated frequency lattice. The relation 
(lie) may then be expressed in the following way. We classify (see § 41) the 
lattice f kn in such a manner that we take as case all the f kn for which 
f kn = p -f 5 holds, whereas represents the opposite case; thus an alterna¬ 
tive lattice for F& results in which the probability 6 n5 holds for the vertical 
sequences, according to (11c). 

The probability of F 5 for the horizontal sequences is determined when we 
assume the frequency interpretation. Since the horizontal sequences y ki ap¬ 
proach for B the limit p of the frequency, it follows that from an element 

1 Notice that the frequency sequences counted through are not normal sequences; they 
rather possess a probability drag, since the frequency can change only by steps from element 

ffi ffi -j-1 

to element, that is, the value m can be followed only by ——r or by —7-7. 

9 W + l 71 + 1 
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/*" onward all further/*" (n > r) of the frequency lattice must belong to Ft, 
so that we have P(j4>rtr = j (14) 


Since the frequency of F s in the vertical sequences is given by the Bernoulli 
probability b nS , according to (lie), and the b nSy according to (28, §49), 
approach with increasing n the limit 1, the f kn classified with respect to Ft 
form a convergent lattice. The Bernoulli theorem , applied to normal sequences 
in the narrower sense , states that the coordinated frequency lattice , classified with 
respect to Ft, is a convergent lattice of the probability 1. Bernoulli’s theorem 
thus assumes the form of a statement about frequency sequences counted 
through and arranged in a lattice. The classified frequency lattice has a 
structure of the following kind: 


Ft 

Ft 

Ft 

F» 

Ft 

Ft ■ 

Ft Ft 

Ft Ft 

.->1 

Ft 

F t 

Ft 

Ft 

Ft 

Ft ■ 

. Ft 

Ft Ft 

.->-1 

Ft 

Ft 

Ft 

Ft 

Ft 

Ft . 

■ 

Ft Ft 

.->d 

t 

t 

V 


1 

+ 


1 1 


bu 

bib 

bzs 

bu 

bbb 

b<M 

. b n & 

b n + l,t b n + 2,1 ■ 

.-^1 


(15) 

This representation by the frequency lattice is restricted, however, to the 
interpreted calculus of probability, since the frequency statement about Ft, 
that is, the relation (14), is not derivable from the Bernoulli theorem, but 
can be inferred only by means of the frequency interpretation. 

These considerations may be illustrated by an example. Assume 


P(A,B) = p = \ ft = 30 5* = 20% 

We obtain from table 4 (p. 273) the value b n s — 0.73. This means, according 
to (4) and (8), that for 5* = 20% the 30th vertical column of the coordi¬ 
nated frequency lattice supplies the value 0.73 as the limit of the frequency. 
In other words, of all the horizontal sequences, 73% will have attained up 
to the 30th element a frequency within the interval \ ± 20%, that is, will 
possess elements of the kind B in a number between 12 and 18. The example 
illustrates the transformation of a probability into a frequency. The state¬ 
ment a\ of the first kind, “B may be expected with the probability , is 
transformed into the statement a 2 of the second kind, “We may expect with 
the probability 0.73 that after 30 elements the frequency will lie within the 
interval \ ± 20%”. The result is derivable even in the formal calculus. 
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When we use the frequency interpretation, the new probability 0.73 is also 
transformed into a frequency; the result is stated in the assertion that in 
the lattice 73% of all horizontal sequences possess the property explained. 

The difference between the two conceptions of the Bernoulli theorem, the 
formal and the interpreted conceptions, is expressed in the following schema 
(P — probability): 


TABLE 5 

The Two Conceptions of Bernoulli’s Theorem 


jP-statement 

Formal conooption 

Interpreted conception 

1st kind 

P 1st kind: formal meaning 

P 1st kind: frequency interpretation 
in an infinite sequence 

2d kind 

P 1st kind: frequency interpretation 
in a finite section of a 

P 1st kind: frequency interpretation 
in a finite section of a 


sequence 

P 2d kind: formal meaning 

sequence 

P 2d kind: frequency interpretation 
in an infinite sequence 
lattice 

3d kind 

P 1st kind: frequency interpretation 
in a finite section of a 

P 1st kind: frequency interpretation 
in a finite section of a 


sequence 

P 2d kind: frequency interpretation 
in a finite section of a 
lattice 

P 3d kind: formal meaning 

sequence 

P 2d kind: frequency interpretation 
in a finite section of a 
lattice 

P 3d kind: frequency interpretation 
in an infinite three-di¬ 
mensional sequence lat¬ 
tice 


The frequency interpretation assumed for the probability in the interpreted 
conception is introduced in the formal conception only in the probability 
statement of the next higher kind. The frequency occurring in the formal 
conception, however, is restricted to a finite section of a sequence or of a 
lattice. The interpreted conception, too, includes frequencies of finite sec¬ 
tions, but the frequency of the highest step refers to infinite sequences or to 
lattices infinite in one direction. 

The results show that an inference from the formal to the interpreted con¬ 
ception can be made only if an assumption about the meaning of the prob¬ 
ability 1 is introduced. This result will be used in § 65. 

In this connection I may point out a fallacy easily committed with respect to Bernoulli’s 
theorem. The supposition offers itself that the property of homogeneity of the lattice, which 
distinguishes normal sequences in the narrower sense from those in the wider sense, could 
be derived from the properties of the latter sequences with the help of Bernoulli’s theorem 
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if the horizontal sequences are assumed to be independent by combinations [see ( 6 , § 34)]. 
On this assumption, the probability of a combination of k elements occurring in a vertical 
column is equal to 

P{A,F^Y = b ki (16) 

F* stands for the frequency of elements B counted vertically. The 6*5 are identical with the 
Bernoulli values b„ s , and thus we have 

lim bk 5 = 1 (17) 

k-+ co 

The result might be interpreted as meaning that a limit p of the frequency of B in the 
vertical sequences is to be expected with the probability 1. But although formulas (16) and 
(17) are correct, this interpretation of them would be incorrect. Its falsehood may be demon¬ 
strated by constructing an opposite case: the lattice y k i may be occupied by elements y k % € B 
in such a way that on the left side of the lattice diagonal line, given by y n , y 22 , 2 / 3 . 1 , . . . , 
only elements y k i e B occur, whereas each horizontal sequence is continued on the right side 
of the lattice diagonal line so as to form a normal sequence, in the wider sense, with the 
probability p. In such a lattice it is possible to demand also the independence by combina¬ 
tions of the horizontal sequences, so that the relations (16) and (17) are satisfied. Neverthe¬ 
less, all vertical sequences possess the frequency limit 1 for B. The lattice is not homogeneous, 
and the quantity lim b k & does not represent the probability that the frequency of a 

if'+ao 

vertical sequence lies within p ± 5 - 

Equations (16) and (17) state only the following property: if we consider k horizontal 
sequences that are completely independent of each other, and count the frequency within 
the vertical columns, then, if we count the columns in the horizontal direction, those vertical 
columns will prevail in which the internal (vertical) frequency lies within F$. if A' is in¬ 
creased, the percentage of vertical columns becomes still greater, if only we count far enough 
in the horizontal direction. (The latter addition is required, because, if the length of the 
horizontal sequence remained unchanged, it could happen that an increase in k would cause 
a diminution in the percentage mentioned.) But it is not possible to make an inference con¬ 
cerning the limit of frequency in the vertical columns. 

A corresponding statement must be made in respect to the process of mixing [see (9, 
§ 34)]. A lattice assumption can be dispensed with if we restrict ourselves to a finite number 
k of horizontal sequences and make probability statements with respect to the horizontal 
direction only. Then the following conclusions can be drawn with the help of Bernoulli's 
theorem and the frequency interpretation: if we proceed sufficiently far in the horizontal 
direction, the percentage of well-shuffled vertical columns approaches a limit, which is 
determined by k and lies the closer to 1 the larger k is. But if we keep the length of the 
horizontal sequence unchanged, then an increase in k may effect a diminution of this per¬ 
centage. 

The objection might be raised that in practice we always deal with a finite number k of 
sequences; but the answer is that also the length n of the sequences in the horizontal direc¬ 
tion is always finite. Idealizing the problem by making a transition to the limit n o° 
means that we wish to make predictions about events that would occur for an increase of a 
finite n. Correspondingly, we want to make predictions about events that occur when k is 
increased; in the idealized problem, also, wc must therefore make a transition to the limit 
k 00 . But then it is necessary to assume lattice invariance for the process of mixing. 
Applied to a finite number k, the assumption signifies a statement about events that- occur 
when k is increased; the assumption that the sequences are completely independent is in¬ 
sufficient for such statements. It is for this reason that only the lattices introduced in this 
work, which are infinite in both directions, provide a complete idealization of such problems. 
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§ 51. The Amplified Bernoulli Theorem 

The probability b n & of Bernoulli’s theorem refers to a combination of n ele¬ 
ments for which the frequency counted in the combination lies within the 
interval p ± 8. But when such a frequency is reached at the n-th element, 
we must not infer that the frequency will remain within the interval p ± 8 
when the sequence is continued further; and, in fact, for normal sequences 
we can never derive a statement that from a specified n the frequency will 
certainly remain within p ± 5. The impossibility of such a statement is 
obvious when we realize that there may occur at any time disturbing runs 
of elements B that are long enough to push the frequency outside the interval 
p ± 8. However, we can inquire at least after the probability that the fre¬ 
quency will remain in the interval p ± 8 after it has reached this interval at 
the n-th element. 

The question is important for the frequency interpretation. To say that 
the frequency/ 71 goes toward a limit p means that for every 8, however small, 
there exists an element y n such that f n lies within p ± 8 and remains inside 
this interval for all the following elements. Such an element is called a place of 
convergence for 8. (I do not wish to incorporate into the concept the condition 
that the element y n is the first element of this kind; thus there exist, if at all, 
infinitely many places of convergence for 5, because, if y n is such a place, all 
the following elements have the same property.) The question to be studied, 
therefore, concerns the probability that an element y n is a place of convergence. 
We shall inquire later (§ 65) what is gained by these considerations for the 
frequency interpretation. For the present we shall deal only with the mathe¬ 
matical question of the probability concerned. 

We begin with the construction of a simpler probability. Assume the 
counting of the frequency is continued, after the elements y n , up to the 
element y 8 (s> n). We ask for the probability that each of the sections 
yi . . • y n , V\ • . . y n + 1, . . . yi • * • V* has a frequency within p ± 8. The 
problem will be formulated in lattice enumeration, since this way of counting 
offers the best illustration of a probability of sequences. We introduce the 
abbreviation 

FV nU = Ft". Fd' n+1 ■ . ■ Ff (1) 

The probability that the frequency remains within the interval p ± 8 of con¬ 
vergence for the section from n to $ is then given by 

P(A,Fs’ n/ ‘) k = c„„ 5 (2) 

The computation of c n ,$ cannot be achieved by the multiplication of the quan¬ 
tities b n 8 , 6 n+ i.a, . . . 6,3, since this would result in too small a value, the terms 
on the right side of (1) not representing independent events. When the fre- 
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quency f n has arrived in the interval p ± 8, there exists a greater probability 
than bnb that it will remain in the interval. The computation of is found in 
the appendix to the German edition of this book (§ 83); only a short summary 
of the results will be given here. 

From (2) we construct the desired probability by assuming s to go to 
infinite values. This probability can be written in the form 

P(A,F\ -)* = lim c nsS = c nl (3) 

a-* oo 

The abbreviation in the second term, which stands for an infinite conjunction, 
is to be defined by means of an all-operator, according to a remark made 
in § G: 

K n ' • • = ot (») rt n/ ‘ (4) 

The relation (3) is based on the assumption of the commutativity of the limit 
operation and the probability operator [see (6, § 41)], since (3) states that the 
limit of the classes F\' n/a for oo has a probability that is the limit of the 
corresponding probabilities. This proof, therefore, holds only for lattices that 
satisfy this condition. 1 The restriction, however, is not serious, since the 
theorem holds only for lattices of a special kind, the proof being valid only 
for normal sequences. Lattices of the kind required play an important part 
in the theory of probability. 

The computation leads to the result that the limit c n $ is not = 0 but that, 


although we have, of course, 




b n i 

(5) 

the relation 




lim c n b = 1 

(6) 


n->® 


holds. This result means: the probability c„a, that the frequency at the n-th 
element lies within the interval p ± 8 and remains within this interval for the 
whole infinite remainder of the sequence , has a finite value and converges to 1 
when n goes to infinity. The theorem holds for every 8, however small; but 8 
must be kept constant when the transition to the limit is made. The result 
can be stated also in the form: the probability that the n-th element is a 
place of convergence for a given 8 goes toward 1 with increasing n. 

This represents an amplification of the Bernoulli theorem, which was 
proved only recently by Gyorgy Polya 2 after the question had been raised 

1 A lattice that does not satisfy the condition can be constructed as follows: we draw the 
lattice diagonal line passing through the elements y u , 2 / 22 * 2/ss, .... Now assume that none 
of the horizontal sequences has a limit of the frequency, but that all the horizontal sections 
to the left of the diagonal line satisfy the relation (2) for every n and s, the probability 
c„ #a being counted vertically. Then the probability on the left side of (3) is = 0, and the 
relation (3) is not true. 

8 Nachr. d. Ges. d . IFiss. z. Gottingen, math.-phys. Kl ., 1921. 



§ 52. THE FREQUENCY DISPERSION 283 

by Paul Hertz. According to the theorem, it is permitted to expect with a 
probability converging toward 1, not only that the desired exactness is 
reached at the n-th element, but also that it is adhered to thereafter. 

The theorem can be formulated for individual sequences also—both for 
(‘numeration in overlapping segments and for ('numeration in sections. In 
the first mode of enumeration a segment of n elements with overlapping is 
moved through the entire sequence; the frequency is counted from the first 
element of the segment so that the starting point of the counting is moved 
along with the segment, and the counting is continued beyond the end of the 
segment through the whole* sequence. We regard as positive the segments for 
which, first, the frequency f n at the end of the segment lies within p ± 5, 
and second, the frequency at all later places remains within p ± 5 for the 
whole infinite remainder of the sequence. The other segments are taken as 
negative. Even for larger n there will usually exist again and again negative 
segments; the* differences in (‘numeration result from the circumstance that 
the frequency f n is always counted from the* beginning of the respective seg¬ 
ment and not from the beginning of the sequence. However, the frequency 
of the segments is counted from the beginning of the sequence. Then the 
frequency of the positive segments, counted for the whole sequence, is — c„s. 
The greater the length of the segments, the larger the number of positive 
segments. 

For enumeration in sections the sequence is divided into consecutive sec¬ 
tions of the length n, which do not overlap; the frequency f n is counted, as 
before, from the beginning of the respective section and, passing the end of 
the section, through tin* infinite remainder of the sequence. The continuations 
of the enumeration then overlap one another. The counting of the sections is 
handled as in the preceding case. 

§ 52. The Frequency Dispersion 

The Bernoulli theorem leads to statements about the convergence of fre¬ 
quency sequences by determining probabilities of a higher kind. It is possible 
to characterize the convergence in a different, though logically equivalent, 
manner, which is based on the concept of dispersion. The concept was devel¬ 
oped in § 37 in a general way, applicable to any amounts u k ; we can use it 
for the convergence of frequency sequences when we treat, in particular, 
frequencies as such amounts. Therefore we speak, more precisely, of frequency 
dispersion; for the sake of brevity the term dispersion will be used in the 
same sense. The concept of dispersion is of particular advantage to practical 
calculations of a statistical kind. 

We begin the investigations with the interpreted conception, by developing 
the concept of dispersion with the help of the frequency interpretation. 
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Starting with lattice enumeration, we construct the frequency lattice according 
to (13, §50): 

pn = F n (A,B ki y ( 1 ) 

The value f kn represents the frequency of the initial section of the length n 
of the A*-th horizontal sequence. This time, however, we do not classify in 
respect to F&, but carry out the calculation with the amounts/*" themselves, 
as in § 39 with the amounts u k \ The lattice has the form 


/“ 

f 12 

/'* . 

f ln 

P l 

/ 22 

P 3 • 

• P n 

f kx 

Jk2 

/*» • 

Jkn 


Apart from the individual values f kn of the amounts, steps of amounts must 
also be introduced. Since we want to consider the frequencies in the n-th 
vertical column, the steps of amounts arc' given by the possible values of the 
frequency for n elements. Since the possible amounts are given by the num¬ 
ber m of the element B, we designate the step of amount by /” and have 


The amount fZ is coordinated to the case Ft?, according to (10, § 50); the 
probability of this case is 

P(A,F k m n ) k (4) 

We can now deal with the lattice/*" as we did previously with the lattice u ki . 

First, however, a different conception of the lattice should be mentioned. 
We can imagine the lattice as resulting from a single sequence that is treated 
in enumeration by sections. If we cut off sections of the length n within a 
sequence, writing these sections under one another, then a lattice is produced 
that is infinite in vertical direction and extends horizontally to the n-th 
column. The f kn represent the internal frequency of the /c-th section of the 
length n; FZ has the meaning given in (6, § 49), and the corresponding prob¬ 
ability is to be written as 

P(A.Snu,FZ) (5) 

where S nn is a regular division of the length n and k — n [see (2, § 30)]. Again 
fZ is the step of amount coordinated to FZ- All the following considerations 
can therefore be regarded as theorems about one sequence enumerated by 
sections. The disadvantage of such a conception is that the transition to 
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larger n requires a new construction of the lattice. It is even possible to inter¬ 
pret these considerations as referring to enumeration with overlapping, but 
such an interpretation is not advisable. The f kn would not be mutually inde¬ 
pendent in enumeration with overlapping for k as running superscript; and 
we would arrive at an unnecessary increase in segments, by which the accuracy 
of the consideration would not be improved. 

Turning now to the treatment of the f kn lattice, we can derive from the 
frequency interpretation the relation 

M(J kn ) n = p ( 6 ) 

For the derivation we use the symbol F\ n , according to (10, §50). Although 
the symbol refers, not to a fixed amount of f kn , but to the interval p ± 5, 
we can proceed as follows. According to (14, § 50), P{A,F\ n ) n = 1, that is, 
P(A,F k 6 n ) n — 0; thus the sum occurring in the average,'according to (12, § 35), 
is reduced to a single term. Employing for the step of the amount first the 
lower limit p — 8, and then the upper limit p + 8, we obtain the inequalities 

P{A,FT) n • (p - 5) g M(f y g P{A,F \y • (p + 5) (7) 

Since this probability is equal to 1, according to (14, § 50), and (7) is to be 
valid for any 8, however small, (G) follows. 1 

The average in the vertical direction will now be considered. We can use 
either of the definitions given in § 35: the statistical definition gives the result 

= lira-Y,f kn = M(/") (8a) 

g-+ao S ^ i 

whereas the theoretical definition assumes the form 

M(jZU = E P(A,Fl n ) k • f: = Mif) (86) 

The two definitions are identical. For enumeration by sections the prob¬ 
ability (5) is to be introduced in the expression (85). 

The value of (8) cannot, of course, be calculated without further assump¬ 
tions in regard to the structure of the sequences. We want to investigate, in 

1 Another form of the proof of (6) obtains if, instead of the classification of the f kn - 
sequence with respect to F 6 , and, in place of the theoretical definition of the average (12, 
§ 35), we employ the statistical definition of the average (15, § 43). Then we can make use 
of a familiar theorem of the theory of convergent series: from lim a n = a we can infer 

✓ 1 n . n -*°° 

that lim ( - £ a» ) = a. We have a n - f kn , a — p. The relation (6) can be derived only from 

vW i = l s 

the frequency interpretation, which is no longer required for the following considerations; 
so (11) and (15) hold likewise in the formal calculus of probability. 
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particular, normal sequences, and for such sequences (8) can be evaluated. 
We have for normal sequences, according to (11 a, § 50) and (7, § 49), 

P(A,F k m n ) k = w nm = p*( 1 - p) n ~ m (9) 

The same value holds for the probability (5) when we employ enumeration 
by sections. Introducing this value into (8), we can carry out the summation 
when we make use of the relations 


( n\ r (n —1 \ , . 

(W,,) 

/n-2\ 

\mj m(m — 1) \m — 2J v 

We first use only (lOu) and have 


=p • e (i_i)p-» a - p) ( -«-«— » 

The transition to the index of summation rn — 1 means that in the sum the 
term with m — 0 is dropped; but the term vanishes, since it contains 0 as a 

n —1 

factor. The sum now has the form K’ n -and is therefore equal to 1, 

m- 1-0 

according to (20, § 49). Thus we have 


M(f kn ) k = p 


( 11 ) 


in analogy to (6). The f kn lattice of normal sequences is thus homogeneous in 
respect to the average, since the average is the same for all the horizontal 
and vertical sequences. 

We can also calculate the dispersion, always assuming that we are dealing 
with normal sequences in the narrower sense. For the dispersion we employ 
the deviations 


8f kn = fk 


V 


8 L=L-p = --p 


(12 a) 
(12 b) 


Like the average, the dispersion can be defined in two different ways. The 
statistical definition, according to (8, § 37), is given by 

A*(/*»)* = lim - 5 2 / fcn = A 2 (/”) 

8-+°° S 1 


(13o) 
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The theoretical definition, according to (6, § 37), supplies the equation 


A 2 (/") m = M(S%U = E P(A,Fl n ) k ■ 5 ?fZ = A 2 (/ n ) (136) 

m —0 

The two expressions are identical, as was demonstrated above. As before, 
the theoretical definition (136) admits of a transformation. With (9), (11), 
(126), and (13, §27) we conclude: 

A 2 (/") = M(f»y - M 2 (f n ) 

= S fe ) 2 ( rn ) p " (1 ~ - P * = ^ - r '' 1 (14 > 

A = S»* 2 (”)/>"(! ~ 

m =0 v 1 / 

Using the factorization m 2 = m(m — 1) + m, we obtain 
A = 23 m(?a — 1) ( 71 lp m (l — p) n ~ m 

m =0 ' v 

+ 23 w ( n ) p m (l — p) n-m 

^0 W 

Applying (106) to the first expression, (10a) to the second, and making an 
inference similar to the one above, we find the following result, when we 
make use of the fact that the sums are equal to 1: 

A — n(n — 1) p 2 • 23 (^ n ~ p) (n ~ 2) ~ (m ~ 2) 

+ n v • E (”_j)p m “ 1 (i - p) ( »- i) - ( " , - i> 

m —1=0 ''' 1 / 
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This represents the theoretical value of the dispersion for normal sequences. 
It may be seen from (15) that the dispersion of normal sequences diminishes 
with increasing n and becomes equal to 0 in the limit w ->■ <». 

These considerations can be extended to other than normal sequences. 
It may even happen that (11) holds also for certain nonnormal sequences, 
though this relation was derived only for normal sequences. For instance, 


TABLE 6 



Normal dispersion, 

Supernormal dispersion, 

Subnormal dispersion, 


sequence (1, § 27) 

sequence (1, § 33) 

sequence (2, § 33) 

k 

fkn 

d fkn 

5 y*n 

fkn 

b fkn 


fkn 

5 fkn 

fiifkn 

1 

0.6 

-0.09 

0.01 

0.7 

-0.14 

0.02 

0.5 

+0.03 

0.00 

2 

0.3 

+0.21 

0.04 

0.2 

+0.36 

0.13 

0.5 

+0.03 

0.00 

3 

0.4 

+0.11 

0.00 

0.4 

+0.16 

0.03 

0.7 

-0.17 

0.03 

4 

0.6 

-0.09 

0.01 

0.9 

-0.34 

0.12 

0.7 

-0.17 

0.03 

5 

0.5 

+ 0.01 

0.00 

0.5 

+ 0.06 

0.00 

0.4 

+0.13 

0.02 

6 

0.7 

-0.19 

0.04 

0.4 

+ 0.16 

0.03 

0.6 

-0.07 

0.00 

7 

0.7 

-0.19 

0.04 

0.7 

-0.14 

0.02 

0.4 

+ 0.13 

0.02 

8 

0.3 

+0.21 

0.04 

0.7 

-0.14 

0.02 

0.4 

+ 0.13 

0.02 


4.1 


0.18 

4.5 


, 0.37 

4.2 


0.12 


M(J n ) ■■ 

+1°° 

ii 

= 0.51 

M(J») - 

4.5 

8 

= 0.56 

mp) - 

if 

1 4^ 

*!io 

- 0.53 



/(U8 



1(137 



/(). 12 



A (J n ) - 

~ \T 

= 0.15 

A (/») ■ 

~ \~8~ 

= 0.22 

A(/") = 

= Vt 

= 0.12 


(11) can be shown to be valid if the horizontal sequences possess probability 
transfer. There are sequences of other types for which (11) holds. All such 
sequences are characterized by the statement that their f kn lattice is homoge¬ 
neous in respect to the average; therefore the relations (12)—(13) remain 
valid for them, too. However, (15) holds only for normal sequences, since 
equation (9) was used for the derivation. Therefore (15) represents a criterion 
of the normal character of sequences. 

These results can be applied for practical purposes as follows. In the prac¬ 
tice of statistical calculations finite lattices always occur. It suffices to com¬ 
pute the f kn for the column n. Let the number of the horizontal rows be s. 
We determine the value p as the mean value M a (f kn ) k , that is, the summation 
occurring in (8a) is carried out only as far as s, and the mean value obtained 
is regarded as a practical limit . By the help of this value the deviations 5f kn 
are determined according to (12a); then we calculate the dispersion by the 
summation according to (13a), likewise carried out only as far as s. Finally, 
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the value thus obtained—the mean square of the deviations—is regarded as 
a practical limit, and we arrive at the statistical value of the dispersion. 

The procedure may be illustrated by an example. In the sequences (1, § 27), 
(1, § 33), and (2, § 33), which were studied previously, the first s — 8 sections 
of n = 10 elements were each counted with respect to B (that is, using 
enumeration by sections), so that the frequency/*" was ascertained for each 
section. The results, together with the values 8f kn , are reproduced in table 6 . 

The 8f kn column shows that the sequences converge differently. The greatest 
deviations occur in (1, §33); the sequence (2, §33) shows the best conver¬ 
gence. In the dispersion A(/ n ) we have, correspondingly, a measure of con¬ 
vergence, which is greatest in (1, § 33) and smallest in (2, § 33); in (1, § 27) 
it has an intermediate value. Had we made the calculation for larger n, that 
is, for longer sections, the 8f kn and therefore the A(/ n ) would all have smaller 
values. 

It is possible to calculate the normal theoretical value of the dispersion, 
that is, the value that the dispersion assumes if the sequences are normal. 
We then substitute in (15) for p the value M(f n ) that we have regarded as 
the practical limit. By comparing the normal value of M (/ n ) thus calculated 
with the statistically found value, we are able to judge whether the assump¬ 
tion of a normal character of the sequence is correct. For this purpose the 
following definitions are introduced: 

1. If the statistical dispersion is equal to the normal one, the sequences 
have a normal dispersion . 

2. If the statistical dispersion is greater than the normal one, the sequences 
have a supernormal dispersion. 

3. If the statistical dispersion is smaller than the normal one, the sequences 
have a subnormal dispersion. 

From this viewpoint, consider the example again. Formula (15) furnishes 
for n — 10 and p = \\ 

A(/ n ) normal = 0.16 (16) 

Comparison with table 6 shows that the sequence (1, § 27) has a normal dis¬ 
persion; (1, § 33) a supernormal dispersion; (2, § 33) a subnormal dispersion. 

The classification of dispersions is related as follows to the preceding classi¬ 
fication, which divides all sequences into normal and nonnormal sequences. 
If the dispersion is normal we are not certain that the sequence is normal; 
but if the dispersion is not normal we can conclude that the sequence is not 
normal. The concept of normal dispersion is more comprehensive than that 
of normal sequence; it provides a less detailed characterization of the sequence. 
For practical applications the classification by the normal dispersion is usually 
preferred, since we can easily ascertain the dispersion, whereas it is rather 
cumbersome to determine whether the sequence is normal. 



290 


FREQUENCY PROPERTIES OF SEQUENCES 


§ 53. The Dispersion of Nonnormal Sequences 

In § 49 the Bernoulli theorem was derived from the assumption that the 
sequence is normal. It is possible to show that this is not a necessary condition 
of a theorem of this kind; sequences of more general character, among them 
sequences with probability transfer, likewise possess a Bernoulli theorem. 
By a Bernoulli theorem we mean a theorem according to which the prob¬ 
ability that the frequenc}' f n lies within p ± 5 converges toward 1 with in¬ 
creasing n. Such sequences satisfy (28, § 49), whereas (27, § 49) holds with a 
different expression for 7 . The type of the sequence has, for all such sequences, 
an influence only on the kind of convergence; the fact remains that there 
exists a convergence toward 1. 

Since the kind of convergence is best characterized by the frequency dis¬ 
persion, w r e shall deal with the problem of dispersion also for sequences with 
probability transfer. For such sequences (136, § 52) is valid; but the quantity 
P(A,F*”)* is not measured by the value w nm of (9, § 52), and therefore 
(15, § 52) does not hold. Thus (9, § 52) must be replaced by a more compli¬ 
cated relation. These calculations were first carried out by Polya. 1 In the 
appendix (§ 82) of the German edition of this book (omitted in this edition) 
I present a calculation developed by V. Bargmann; it provides for the value 
A(/ n ) an approximation given by the formula 

= y] 1 + Y ~ Z y ' 'M/") normal (1) 

The formula is valid for enumeration with overlapping, and holds also for 
enumeration by sections, if the sequences possess a regular domain of invari¬ 
ance. For lattice enumeration the corresponding condition of lattice in¬ 
variance must be added, e is the quantity that characterizes probability 
transfer, which was called the degree of transfer. If we restrict the considera¬ 
tion again to the nondegenerate case 1 — p — e > 0, which was formulated 
in (16, § 33), the sign of the second term within the scope of the square root 
of (1) depends only on the sign of e. We thus arrive at the result 

1. For e = 0 the dispersion is normal. 

2. For positive e, that is, probability drag, the dispersion is supernormal. 

3. For negative e, that is, probability compensation, the dispersion is sub¬ 

normal. 

1 Gyorgy P 6 lya, “Uber die Statistik verketteter Vorgange,” in Zs. /. angew. Math, u . 
Mech.y Vol. Ill (1923), p. 279; and Sur Quelques Points de la thtorie des probahiliUs (Paris, 
1930). 
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Theorem 1 is a matter of course; the significant result is formulated in theo¬ 
rems 2 and 3. They can be paraphrased as follows: 

The generalization of the sequence type introduced by the concept of prob¬ 
ability transfer includes a classification of sequences corresponding to the classifi¬ 
cation provided by the concept of dispersion . Sequences with probability drag 
converge less well than, and sequences with probability compensation con¬ 
verge better than, normal sequences. 

This fact permits an inference from the dispersion to the sequence type 
if it is known from other reasons that the sequence has the character of prob¬ 
ability transfer expressed in (9, § 33) (that for a selection by predecessors 
only the first predecessor is relevant), and if it is furthermore known that the 
sequence has a regular domain of invariance. On these conditions the following 
classification can be set up: 

1. For normal dispersion the sequence is normal. 

2. For supernormal dispersion there exists probability drag. 

3. For subnormal dispersion there exists probability compensation. 

These reversals of the previous theorems follow because, if the sequences 
have probability transfer, they satisfy formula (1), from which we easily 
derive 

A (/") = A (/») normal implies e = 0 

A (/") > A (/ n )normal implies € > 0 
A(/ n ) < A (/”) normal implies 6 < 0 

Once the value of A (/") is found empirically by the use of the statistical 
definition (13a, § 52), the value of e can be derived from (1), if only we know 
that the conditions of probability transfer are satisfied. 

Formula (1) may be applied to the example of the sequences (1, § 33) and 
(2, § 33), which were constructed as sequences with probability transfer; we 
explained that for (1, §33) wc have e = + and for (2, §33), € = — 
With these values, formula (1) supplies 

for (1, § 33): A(/") = 0.22 for (2, § 33): A(/*) = 0.11 (2) 

These values are in agreement with the statistically found values, 2 which 
were given in table 6 (p. 288). Conversely, c can be computed from the sta¬ 
tistical values given in the table. 

In more general sequences, which are not subject to the condition of prob¬ 
ability transfer, the correspondence between dispersion and selection by the 
predecessor no longer exists. For instance, it may happen that, for subnormal 

2 1 emphasize once more that the sequences chosen as examples are too short to be used 
for a reliable calculation of the dispersion. I merely wish to demonstrate by this example the 
method of calculation. 
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dispersion, P(A .B,B l ) > P(A,B), as will be shown by an example in §56. 
In these oases the sequence type is too complicated to admit of a sufficient 
characterization by a single quantity, such as is supplied by the dispersion 
or the degree of transfer c. Classification by dispersion represents for such 
sequences a relatively rough characterization by which only certain average 
properties of the sequences are expressed. 

For probability sequence's of the most general type we cannot even guar¬ 
antee that a dispersion exists, that is, that the sequence of mean values in 
(13a, § 52) converges toward a limit with increasing s. That this convergence 
is contingent upon certain conditions is seen from (135, § 52). The formula, 
it is true, contains only a summation over a finite number of terms, and 
therefore a dispersion will always exist if the probabilities P(A,F k *) k exist. 
But the existence of the probabilities depends on the existence of thfi indi¬ 
vidual phase probabilities (§27), or, in lattice enumeration, on the condition 
that the vertical sequences are of a probability character and possess com¬ 
binatory probabilities. The type of probability sequence for which a disper¬ 
sion exists is, therefore, very general, but it is not identical with the most 
general type of probability sequence. 


§ 54. A Simple Interpretation of the Dispersion 


In § 43 the probability that the deviation remains within the limits given 
by the linear dispersion was calculated for a Gaussian distribution, or normal 
curve. It was found that is independent of the measure of precision h and 
very nearly equal to f. This fact permits a corresponding interpretation of 
the dispersion, since the Bernoulli distribution can be approximated by a 
Gauss exponential function, according to (27, § 49). In fact, if we substitute 


(15, § 52) for 5 in (27, § 49), we find for y the value which, according to 

V 2 

(11, § 43), leads to — f. Considering, for the present, only normal sequences, 
we can formulate the following result: the dispersion determines the limits 
within which the frequency is to be expected vsith the probability |. (See fig. 16, 

p. 222.) 

The result may be illustrated by representing numerically the value of 
A*(/ n ) in a table calculated in accordance with (15, § 52). (See table 7.) Like 
6* in table 4 (p. 273), A*(/ n ) stands for the dispersion measured in percentage 
of p ; so we have p ± A(/") = p(l ± A*(/ n )). 

The tabulation is only a slightly changed and abbreviated reproduction 
of table 4, which states the probability b n & for given limits 6* as a func¬ 
tion of n. Table 7 states the limits A *(/ n ) that exist for a fixed probability 
b n s = f as a function of n. We see that the specification of A*(/ n ) represents 
only a more convenient form of a statement that was made above by the 
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help of the Bernoulli probability b ni . Instead of saying, “If p — then for 
100 cases the frequency lies with the probability 0.95 between 40 and 60”, 
we now say, “If p — then for 100 cases the frequency lies with the prob¬ 
ability 1 between 45 and 55”. For the latter sentence we use the abbreviation, 
“If p — \ ) then for 100 cases the dispersion A*(/ n ) amounts to 10%”. The 
concept of dispersion contains no new logical problems; its use is logically 
equivalent to the probability statements of the Bernoulli theorem. 


TABLE 7 

Dispersion for Normal Sequences 


V 

v< 

Yz 

% 

n 

A*(/") 


A*(/ n ) 

]0 

55% 

32% 

18% 

30 

32% 

18% 

11% 

50 

25% 

14% 

8% 

100 

17% 

10% 

6% 

500 

8% 

4% 

2|% 


The approximation of the b nh by a Gauss function is not necessarily linked 
to normal sequences; certain nonnormal sequences have the same property, 
though with a different measure of precision because of their different dis¬ 
persion. Thus the relation = f also holds for many sequences with super¬ 
normal or subnormal dispersion and, in particular, for sequences with prob¬ 
ability transfer. Upon this fact rests the great practical value of the dispersion. 
The statistical calculation of the dispersion represents a procedure that, with¬ 
out a more exact knowledge of the sequence type, permits determination of the 
limits within which the deviation can be expected with the probability §. 

§ 55. A Simple Derivation of Bernoulli’s Theorem 

A proof of Bernoulli’s theorem will now be presented. It recommends itself 
by an amazing mathematical simplicity, though it exhibits less clearly the 
logical problems of the theorem. The derivation is based on the properties of 
the dispersion and the Tchebychev inequality (18, §37). 

We start with the method developed in (3, § 35) by which the enumeration 
of classes is reduced to an addition of amounts, a method employing the 
amounts 1 and 0 supplied by the symbol V. We designate the cases B and 
B by Bi and B 2 and coordinate to them the amounts u x — u(B0 and — u(B 2 ) 
by the definition 

u* - Vfa € B0 u k = u(B k ) = V(B k = B0 (la) 

that is, 

U\ — 1 u 2 — 0 (15) 
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Furthermore, we define the amount of a combination of consecutive elements 
in the sense of (1, § 38) by the addition of the separate amounts. Thus we have 

u(B,.Bl ) = u(B, ) + u(B k ) = Ui + Uk (2) 

A combination of n elements that contains m t elements B gives the value 

u{B \ L .... Bk n ) — tiki + . . . . + Uk n — mi (3) 

According to (3 b, § 38), we have 

Mixiki +....+ Uk n )ki • • • k n = M(uki)ki +. . . + M(uk n )k n = n • M(u ) ( 4 ) 
With the help of (1 b) we obtain 

M(u) = 2 PU,Bk) ■ u k = p (5) 

and thus find, with (3), 


M(uki Uky)k\ ■ • . k n = M(nii) i = M(m) = np (6) 

where m designates the possible values 0, 1, 2, . . . n for ni; or, with (13, § 35) 
and (3, § 52), 

M = M(f") = p (GO 

The equation may be explained by the following considerations. Because of 
the definition (1) for the amounts, taking the average for sections of the 
length of one element means nothing but counting the elements B; thus M(u), 
according to (5), is equal to the probability of B. Since the average is additive, 
we derive with (6) the result that for sections of the length n the average 
of the relative frequency must also be equal to this probability. 

The result was derived without any restricting assumptions concerning the 
structure of the sequence, since (4), as was shown in § 38, holds for any kind 
of mutual dependence of events, and the nature of the phase probabilities is 
irrelevant for (4). However, if we now calculate the value of the dispersion, 
we must introduce a specializing assumption concerning the structure of the 
sequence. Assume that the sequence is normal. Since then, according to § 38, 
the dispersion is additive, we have 

A 2 (^x + . • • + Ukjhi . . . k n — W(u kl ) kl + . . . + A *(uk n ) kn 

= n • A 2 (w) (7) 

Now we obtain from (1) and (5) 

bui ~ Ui — M(u) = 1 — p 8u> z = U 2 — M(u) = — p (8) 
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and thus 

2 

A 2 (m) = J2 P{A,B X ) ■ Shli 

t — 1 

= p(i - r >) 2 + (i - v)v ' 1 = pO - v) 

Therefore we have, with (7) and (3), 

A 2 (w*j + • • • + u kn ) ki = A 2 (m) = np( 1 - p) 

With (9a, § 37) and (3, § 52) we obtain 

4 ’(ir) - a ’<A>' - A ’<«- 


= A 2 (/ n ) = 


p(l - p) 


ft 


(9) 

( 10 ) 


01 ) 


This is identical with the value of the dispersion for normal sequences derived 
in (15, § 52). The occurrence of n in the denominator, and thus the dying 
down of the dispersion toward 0 with increasing n } appears here as the effect 
of the law of the compensation of the dispersion expressed in (7), according 
to which the quadratic dispersion increases only with n and not with n 2 
[see (85, § 38)]. 

Since for normal sequences in the narrower sense the probabilities in lattice 
enumeration are equal to those holding for enumeration with overlapping or 
for enumeration by sections, the results can be interpreted also in lattice 
enumeration, but that interpretation need not be elaborated. 

Further conclusions can be drawn by the use of Tchebychev’s inequality 
(18, § 37). When we substitute for A 2 (u) the value A 2 (f n ), we obtain 


Wh ^ 


p(i - p) 

nb 


( 12 ) 


On account of the definitions (la) and (15), w s represents the probability 
that the relative frequency for n elements lies within p ± 8, and therefore w s 
is identical with the Bernoulli probability 5„«. Since n occurs in the denom¬ 
inator of (12), uu goes toward 1 for increasing n, if 8 is kept unchanged; thus 
we have 

lim 5 n a = 1 (13) 

n-*oo 

in correspondence to (28, § 49). 

This relation is virtually identical with the first of the relations (25, § 49), 
differing from it only in that the interval ± 8 is assumed symmetrical with 
respect to p. If p is not in the center of the interval, we can cut off an outer 
section of the larger part of the interval so as to place p in the center; since 
the convergence to 1 holds for this smaller interval, it holds also for the orig- 
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inal interval. This proves the first of the relations (25, § 49). The second 
relation (25, § 49) follows from the first in consideration of (20, § 49). 

The most important conclusion of Bernoulli’s theorem is thus proved. But 
it is not possible to derive the approximate validity of the Gauss distribution 
in this manner. 


§ 56. Poisson Sequences 

A special type of nonnormal sequences was investigated by Poisson. He 
considers the sequence that results when an event is played for with the suc¬ 
cessive probabilities pi . . . p x , so that after a period of length X the prob¬ 
abilities repeat themselves in the same order. In order to find the structure 
of such sequences we must first translate the given definition into our sym¬ 
bolism. 

Poisson’s arrangement may be regarded as a regular division of the length 
X, for which we have 

P(A.S Xk ,B) = p K k = 1,2, ... X (1) 

Thus p* is the probability within the k- th subsequence. Besides, from the 
nature of the division we have 

P(A,S\ t ) = i k = 1,2 ... X (2«) 

X 

P(A.Sx,,S P x,«+p) = 1 

P(A.S^SU = 0 for v^k + p (2b) 

(2a) follows from the frequency interpretation, (25) from axiom n, 1 because 
of (S K D S p K+p ). The subscript is again counted cyclically: we put k — 1 = X 
for k = 1, k + 1 = 1 for k = X. We inquire after the probability P(A,B) = p, 
that is, after the probability in the major sequence. According to the theorem 
of elimination (21, § 19), we have 

P(A,B) = i: P(A,Si') ■ P(A. (3) 

With (2a) and (1) we thus find 

P(A,B) = p = 1 -'t P ' (4) 

A «-l 

This probability is the mean value of the individual probabilities, a result 
which seems plausible. For the derivation, however, the frequency interpre¬ 
tation was used in (2a). If we do not use this interpretation, but remain in 
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the formal calculus of probability, we cannot prove that the mean value p 
has the meaning of a probability; but even then it can be shown, in analogy 
to Bernoulli's theorem, that the mean value takes over the role of the critical 
point of the Bernoulli curves. Because this proof was the chief aim of Poisson, 
his theorem is often regarded as an extension of the Bernoulli theorem. Like 
normal sequences, the Poisson sequences satisfy the relations (25, § 49), but 
their convergence is quantitatively different. The result will be proved in 
(17). It is preferable, however, to define the Poisson sequences in the formal 
calculus by the conditions (2a) and (25) in combination with (1); then all 
the following calculations can be carried out in the formal calculus of prob¬ 
ability. But then we lose the possibility of interpreting equation (2) as a 
regular division. 

The inner order of the sequence will be investigated first. Its structure is 
not yet defined by (1) and (2); we must add Poisson's assumption that the 
individual subsequences are mutually independent. The assumption is written 

P(A .SL.Bl, . . . Bl~ljV lv ) = P(A.SL,Bl) (5) 

n ...?„ = 1,2 B x - B B 2 = B 


Because of (25) we can conclude with the help of (4, § 25) that S\+ K p + P may 
be added in the first term; by applying (4, § 25) again, we show that S\ K 
may be dropped in the first term, thus obtaining 




■ B'-ljtl) = P{A.S\jr it ) 


( 6 ) 


We can now 7 determine the phase probabilities. According to the theorem of 
elimination (21, § 19), we have, for instance, 

P(A.B,B J ) = J2B(A.B,S\.) ■ P(A.B.Sl,B l ) (7) 

K -=1 

With (5) we obtain 

P(A. B S\.,B ’) = P(A . S\ k ,B 1 ) = P(A . S X .,B) = p K (8) 


Now we have, because of (2b) and (6, § 25), since we may assume P(A,B) > 0, 


P(A.B.SL,S x , K -0 = 1 


PiA.B.Si.^SL) = 1 


Using the theorem of multiplication, we derive 


P(A.B,S {,) 


P(A.B.Sl,S K .-i) 

P(A ,B,S\, k -i) ■ P(A .B.S\ 


= P(A.B,S Kk _,) 



( 9 ) 
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From Bayes’s rule (10, § 21) we obtain 


P(A.B,8x 


-0 = 


P(A,Sx.,-.) • P(A..Sf x .,-.,g) 

P—1 


P«-l 

I>p 


P-1 


The relation (7) thus assumes the form 

x 

Z P«P«-i 

P(A.B,B') = *=4- 

Z Pp 

p =1 


( 10 ) 


01) 


This probability will, in general, differ from P(A,B). With the help of (11) 
and (4), we obtain for the ratio of these probabilities the value 


P(A.B,B 9 
P(A,B) 


X 

X • Ew-i 


X X 

Z P« • Z 

K-l P=*l 


( 12 ) 


This ratio is, in general, different from 1; the Poisson sequences are thus not 
free from aftereffect. But the ratio (12) may be smaller or greater than 1; 
whether the first predecessor induces a tendency to stay or a tendency to 
change depends on the values and the arrangement of the p*. If X = 2, (12) 
is always < 1 (so long as the two p K do not have the same value); here there 
is a tendency to change. For larger X, however, (12) may assume different 
values. For instance, if X = 8 and pi = p 2 = p 3 = Pa = f, ps = pe = pi = p% 
— J, we have P(A,B) — §, P(A B,B l ) — A, that is, the ratio (12) is greater 
than 1. This result can be explained as follows: the attribute B will result 
prevalently for elements played with pi . . . p*; the attribute B, for elements 
played with p 6 . . . p%. The ^-elements will thus prevalently possess B-ele- 
ments as successors. 

The dispersion is calculated as in § 55; we use for the determination of the 
frequency the definition of amounts given in (la and 16, § 55). In analogy to 
(5, § 55), for the first element of the sequence, M(u) is — p\) for the second 
element it is = p 2 ; and so on. Thus we obtain by addition, as in (4, § 55), 

M{u[B l . . . B”]) = J Z>. 


(13) 
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This holds strictly only for a value of n that is a multiple of X, but it is valid 
in good approximation for any large n. Furthermore, the dispersion for the 
first element is = pi • (1 — P\)\ for the second, it is = p 2 • (1 — pf)', and so on. 
Thus we obtain by addition, in analogy to (7, § 55), 

n x 

A *(«[** . . . /?"]) = - • E P«0 - P.) (14) 

X *=*1 

The transition to the relative frequencies, analogous to ( 11 , § 55), gives 

M (/") = 7 E = V A '(f n ) = ~ E P«0 - P«) (15) 

X K ATI 

In order to prove, in analogy to (12, § 55), that the frequency converges 
toward the mean value p, we use the Tchebychev inequality (18, § 37) and 
construct the expression 

M’« ^ 1 - • E P«(l - P«) (1°) 

With w& — b n & (though these quantities are different from the b n5 of normal 
sequences) we conclude 

lim b n & = 1 (17) 

n~+ do 

This relation is equivalent to the relations (25, § 49), as was shown at the end 
of § 55. 

Furthermore, an important result concerning the dispersion follows from 
(15). We can show that A 2 (/ w ), according to (15), is always smaller than (in 
the limiting case, equal to) A 2 (/ n ), as given by (15, § 52), if the latter dis¬ 
persion is calculated by using the mean value p, according to (15): the Poisson 
sequences always have subnormal dispersion in comparison with a normal 
sequence of the same frequency. This conclusion can be drawn by the help of 
the Schwartz inequality , 1 

( i s ( g «■) • ( t 6 ’) (18) 

which for = 1 assumes the more special form 

( n \ 2 n 

E«.) = n • E 09) 

»-1 / »-1 


1 See, for instance, R. Courant and D. Hilbert, Methoden der mathematischen Physik 
(Berlin, 1924), Vol. I, p. 2. 
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For our purpose we compare the two expressions 

n ■ A 2 (/ n ) PoiBSOn = i X P«(l - P«) = P\ (20a) 

A «-l A *-l A *-1 

w • A 2 (/") normal = p(l-p)=^Z?»<-^ V^j (206) 

The first term in each is the same, but the second term, which is to be sub¬ 
tracted, has a larger absolute amount in the first expression because of (19) 
(it has an equal amount only in the limiting case), and thus we have 

A (/") Poisson = AC/") 

normal (21) 

Since in (18) the equality sign can hold only if the a* are proportional to 
the hi, the equality sign in (21) can result only if all the p K have the same value, 
that is, if the Poisson sequence goes over into a normal sequence. 

Because of this property of the dispersion the Poisson sequences represent 
a case in which classification by the dispersion does not correspond to classifi¬ 
cation by phase probabilities. The discrepancy originates from the fact that 
in the Poisson sequences the phase probabilities are not determined by the 
first predecessor alone. In general, 

P(A . B . B\IP) t* P{A. B\B 2 ) (22) 

Thus we are dealing with a type more general than that of probability transfer. 
Furthermore, the Poisson sequences are, of course, not regular-invariant, a 
fact that can be seen from their definition. In another respect, however, the 
Poisson sequences are more special than sequences with probability transfer, 
since they can represent only the type of subnormal dispersion and never 
that of supernormal dispersion. 

The latter peculiarity may be illustrated by comparing the Poisson se¬ 
quences with a type that, in contradistinction to (2), is defined by (1) and 
the relations 


P(A,S U ) = 

k = 1,2, ... X 

(23) 

P(A.S^.Sl t . 

. . Sxj 


= P(A,SL r ) = 

P(A,S\, r ) 

(24) 


This definition means that, as before, the main sequence divides into subse¬ 
quences. The division is not regular, however, but of a random type in the 
sense of (24). The arrangement can be illustrated as follows. First we draw 
the ball k from an auxiliary bowl containing X balls; then we draw a ball 
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from the bowl S \ K , in which there are black, B, and white, B , balls in the 
ratio p K . The procedure is repeated for every element of the sequence. Besides 
(24), we assume independence, according to (5), in the form 


P(A.S[ tl . . . . . . BlrluK) = P(A.SLJO 

a condition that is also satisfied by the illustration. 


In analogy to (3) and (4) we have 

P(A,B) = £P(A,Sx.) • P(A.Sx',B) 


1 X 
J-Ep. 

A <C “ 1 


(25) 


( 20 ) 


that is, the mean value p represents also in this case the probability of the 
major sequence. But the phase probabilities result in a different form. We 
have, for instance, 


P(A . B,B l ) 


X 



P(A.B,Sl) 


P(A .B.SLjB 1 ) 


(27a) 


For the last probability, formula (8) is valid because of (6, §22); the pre¬ 
ceding probability is equal to P(A,S\<) = ~ according to (24), and we obtain 

A 

P(A.B,B') = p (27 b) 

A 

A corresponding conclusion can be derived for any length of groups of prede¬ 
cessors. With (24) and (25) we have 


P(A .B< . . . B^B*) 

= T,P(A.B, a . . . b;,-- u SL) ■ P(A.B to 

K «1 

= EP(A,SL) • P(A.SL,B V ) 


. B'-h.SloB 9 ) 



(28) 


We thus obtain a sequence that is free from aftereffect. It is even a normal 
sequence if we add to (24) and (25) the assumption that the regular divisions 
belong to the domain of invariance of these probabilities, that is, (24) and 
(25) hold likewise in subsequences that are selected by a regular division. 

The difference between the Poisson sequences and the sequences defined 
by (23)-(25) may be illustrated as follows: for the Poisson sequences the 
individual elements are played with the probabilities pi . . . p\ in a regular 
order; for the sequences defined by (23)-(25) the individual elements are also 
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played with the probabilities pi . . . p\, but since the selection of the prob¬ 
ability for each element is left to chance, there arises a new source of disper¬ 
sion that makes the dispersion greater than that of the Poisson sequences. 
Now the dispersion of the sequences (23)-(25) is the normal one. This is 
apparent, for instance, in (27a); since we have there P(A .B,S\ K ) = P(A,S\ K ) 


1 

X’ 


the mean in (276) is taken in the same manner as for the major prob¬ 


ability P(A,B) according to (20). In the expression (7), however, the term 
P(A.B,S{ k ) is determined by according to (9) and (10) because of the 
regular succession of the S\ K ; it is thus not equal to P(A,S\ K ), and so the 
mean in (11) is formed differently. 

For the phase probabilities of the Poisson sequences this may involve a 
tendency to change or a tendency to stay, depending on the length of the 
phase with respect to the total period of the p K . On the whole, however, these 
conditions produce a tendency to change; long groups of the same elements 
will be rare. This result is obvious for X = 2. For instance, if p x = £, p 2 = f, 
we have p = but since we play alternatively with £ and f, B is obtained 
prevalently at one time, B prevalently the next time, with the result that the 
change from B to B is stronger than for the play with constant probability 
For larger X, too, a tendency to change must eventually arise. For the example 
with X = 8, groups of 4 successive elements B will still be more frequent 
than in normal sequences; yet groups of 8 elements B will occur less fre¬ 
quently, since in 8 elements the change of the pi . . . p § will become notice¬ 
able. The regular cycle of the Poisson sequences thus acts in the sense of 
“a shuffling increased above the normal”. It is therefore the probability 
aftereffect in respect to the S\ K , expressed in (26), that carries with it the 
subnormal character of the dispersion of the Poisson sequences. 

The conditions of the Poisson sequences are clearly exhibited in lattice 
enumeration, since the vertical sequences express immediately the p K . The 
resulting lattice is nonconvergent and not lattice-invariant, since because of 
the independence of the subsequences wc have 


P(A.B ki ,B k - i +') k = P(A,B k ’ i+p ) k 


(29) 


in contradistinction to the corresponding phase probability P(A .B,B p ). In 
the lattice representation the definition of the Poisson sequence can be 
extended even to X o° if we add the condition that the mean value p 


1 x 

lim - £ p K = p 

n-*°o A 


(30) 


exists. The amount n • A 2 (/") then remains finite, as can be seen from (20a); 
therefore the dispersion approaches 0 with increasing n, as in all cases previ¬ 
ously considered. Here, too, the dispersion is always subnormal. 
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The importance of the Poisson sequences for the theory of probability has 
been overestimated. They have been regarded as supplying a generalization 
of the Bernoulli theorem; but this generalization has the disadvantage of 
being one-sided, since these sequences always lead to subnormal dispersion. 
Furthermore, as shown by (22), the sequences represent a somewhat involved 
generalization of normal sequences. For these two reasons, sequences with 
probability transfer are superior in logical significance. 

However, the Poisson sequences possess a certain practical importance, 
particularly in the form extended to infinitely many p K according to (30). 
Consider the transactions of a businessman: his chances of making money 
will differ from case to case, and are given by probabilities p K , which, how¬ 
ever, lie between two not-too-distant limits p a) and p (2) and satisfy (30). 
Therefore he can expect an average gain p, even with a lower risk than 
would obtain for constant probability, since the sequence has subnormal 
dispersion. We shall return in § 72 to these results, which are relevant for the 
practical value of the frequency interpretation. 


§ 57. Bernoulli Sequences 

It has been shown in repeated instances that Bernoulli’s theorem is satisfied 
also by sequences other than normal, if the theorem is interpreted as meaning 
the convergence relations (25, §49). Conversely, the Bernoulli theorem can 
be used for the definition of a sequence type of a very general nature. 

The definition must be prefaced by some mathematical remarks about the 
Bernoulli functions b n and w n (f ), connected by the relation (22, §49). The 
latter function was defined in (21, § 49) in terms of the function w nm , which, 
in turn, was defined by Newton’s formula (7, § 49). This formula is restricted 
to normal sequences and must be abandoned for the general case, whereas 
the other two formulas can be taken over. We thus assume that for the 
sequences to be defined there exist probabilities w nm , which, however, may be 
different for the three modes of enumeration and are given, respectively, by 

Wnm = P(A y Fm) for enumeration with overlapping (1) 

w nm = P(A .ShyFm) for enumeration by sections (2) 

w nm = P(A,Fm) k for lattice enumeration (3) 

The symbols FZ and Fm have the meanings defined in (6, § 49) and (10a, § 50). 

For normal sequences the w nm are determined by the probability p of the 
sequence. For sequences of more general types the w nm depend on further 
parameters, such as phase probabilities, probabilities in regular divisions, 
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and lattice probabilities. These parameters may be named s i . . . s r , and the 
Bernoulli density w n (f) defined in (21, §49) can therefore be written: 


D) 

W,Xp, «1 ■ • • 'Sr;/) = (« + 1) • W nm f = — (4) 

The semicolon indicates that ?/’„ is a relative probability function. The 
Bernoulli probabilities b n depend on the same arguments and may be written 


bnip,Si . . 

■ «r;/.,/-) = 

■ ■ ■ S r ]f)df 

(5) 

They satisfy the relation 




b n (p,Si . . . 

Sr ;0,1) = J U' n {p,Si . . 

■ sr-jyij = 1 

( 6 ) 


The functions b n are said to have Bernoulli properties if convergence relations 
analogous to (25, § 49) hold: 


lim b n (p 9 si . . . s r ;/,,/») = 1 for/i g p ^ h 

n-*ao 

lim b„(p,si . . . « r ;/i,/ 2 ) = 0 for p < /, or p > (7) 

By means of integrations over the parameters Si . . . s r , the following 
function is constructed: 

J Vi re r 

... I Wni'PtSi . . . S r ;f)(ls r . . . (Isi ( 8 a) 

C*1 « r 

The values and f} t are the end points of the range of and may be func¬ 
tions of p. Correspondingly, a function b* is defined: 

KivJuh) = J^w*(p;f)df ( 86 ) 

This function can be shown to have Bernoulli properties in the form 

lim 6 *(p;/i,/ 2 ) = b(p) for/i g p g / 2 
00 

lim b*(p;f h f 2 ) = 0 for p < .A or p > / 2 (9) 

n-* oo 

The value 1 of the convergence is here replaced by a function b(p). The 
relations (9) are derivable from (7) through commutations of the integrations 
over / and the Si . . . s r and subsequent commutation of the integration 
over the Si . . . s r and transition to the limit. The admissibility of these 
commutations is understood. 
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The assumption is now introduced that the function w* has Bernoulli 
properties of a second kind, holding for integration over p: 

ri>2 

= I <(? >;f) d P 

Jpl 

lim = k(f) for j)i g / g p 2 ( 10 ) 

/(-►no 

lim k n (J;p lt p 2 ) = 0 for/< pi or / > 

7/ —► ci> 

The limit function k(f) is assumed nonvanishing and finite. The convergence 
relations ( 10 ) are, derivable from (9) if the functions b* converge smoothly, 
i.e., if, for every interval not containing the critical point, the zero convergence 
of the area is associated with a uniform zero convergence of the ordinates w*, 
and if the assumption is added: 

lim f w*(p;f)dp = k(f) (11) 

n~* oo J 0 

The relations ( 10 ) have the following meaning: 

If the cross sections p = const, of the function iv*(p;f) have the limit prop¬ 
erties of Bernoulli functions for f = p as critical point, the cross sections 
/ = const, have the same limit properties for p — f as critical point. Note 
that the cross sections p — const, have the staircase shape of histograms, 
whereas the cross sections / = const, are smooth curves. 

The proof of theorem ( 10 ) is as follows. Without a specific knowledge of 
the function w* we cannot say that the cross sections p = const, have a 
maximum at / = p\ but from the Bernoulli properties (9) it follows that for 
increasing n there must arise a maximum close to / = p, which converges 
toward J — p for n-+- «». The ordinate at the critical point is the critical 
ordinate of the curve; it goes to infinite values for n oo ? whereas every 
noncritieal ordinate / ^ p goes to 0. For a curve / = const., the ordinate 
p — f must be the critical ordinate, too, because it is so for the corresponding 
curve p — const., and thus goes to infinite values. Any other ordinate p 9 ^ f 
of the curve / = const, is, at the same time, a noncritieal ordinate of a certain 
curve p = const., and thus goes to 0 with increasing n. Since, according to ( 11 ), 
the area between the curve / = const, and the p-axis goes to k with increasing 
n, the first of the relations (10) follows. The second of these relations is then 
a consequence. 

After these mathematical preparations, I now define Bernoulli sequences 
as follows: 

A sequence of the probability p is a Bernoulli sequence in the wider sense 
if both for enumeration with overlapping and for enumeration by sections 
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there exist probability functions w n (p,s 1 . . . s r ;f) such that the relations (7), 
(9), and (10) are satisfied. 

A lattice of sequences is a lattice of Bernoulli sequences in the narrower 
sense, or Bernoulli lattice, if each sequence is a Bernoulli sequence in the wider 
sense, and if lattice probability functions w n (p,s i . . . s r ;f ) exist such that 
relations (7), (9), and (10) are satisfied. 

The definition of the b n for different modes of enumeration is given by 
(1)~(3) in combination with (4)-(6). Note that the functions b n need not be 
the same. 

Using the symbol F\ n defined in ( 10 c, § 50), we can write the first relation 
(7) for lattice counting in the form 

lim P{A,F\ n f= 1 (12) 


According to § 50, this means that the frequency sequences of a Bernoulli 
lattice form a convergent lattice. 

The Bernoulli functions of normal sequences were shown in (20, § 49) and 
(25, § 49) to have the properties (G) and (7). It can be proved that they have 
the property (11) also. For normal sequences the function w n (see 7 and 21, 
§ 49) is identical with w*, since p and / are here the only arguments, and has 
the form: 

Wn (p;f) = (n + 1) V m (l - v) n ~ m f = J 1 (13) 
This function has the property 

£ Wn(p;f)dp = 1 (14) 

for every n, so that here k = 1. Formula (14) is proved by integration of the 
function (13) over p. For this purpose we use an auxiliary formula, known 
from the theory of Euler’s integrals; it holds for integers a and 5, which are 
^ 0, and is derivable by repeated use of integration by parts: 




x“(l — x) h dx = 


albl 


(o + 6 + 1 )! 


(15) 


By the use of this formula, (14) is easily verified. It follows that the limit 
properties (10) hold for the Bernoulli functions of normal sequences, in the 
form: 

kniflPhPi) = I wJ:p;f)dp 
J Pi 

lim fc„(/;pi,p 2 ) = 1 for p x ^ ^ p 2 


lim k n (J;p h pi ) =0 for f < pi or / > p 2 

nr* CO 


(16) 
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For normal sequences the three kinds of enumeration lead to the same func¬ 
tions b n , whereas, of course, the functions b n and k n have different mathe¬ 
matical forms. The amplified Bernoulli theorem, derived in § 51 for normal 
sequences, is not included in the definition of Bernoulli sequences and cannot 
be derived for such sequences without further presuppositions. 

The question may be asked whether it is possible to characterize the 
Bernoulli sequences thus defined in a different manner, for instance, by prop¬ 
erties of phase probabilities or by their behavior in respect to regular divisions. 
No comprehensive answer to the question has yet been found. All we know 
is that several sequence types belong to the Bernoulli sequences; thus, apart 
from normal sequences, sequences with probability transfer and Poisson 
sequences are Bernoulli sequences. 




Chapter 8 

THEORY OF PROBABILITIES OF 
A HIGHER LEVEL 




THEORY OF PROBABILITIES OF 
A HIGHER LEVEL 


§ 58. Probabilities of a Higher Level 

The construction of the probability expressions so far employed represents, 
in one respect, a special case. Every probability expression contains the 
probability implication only once: the probability implication stands between 
terms that are not themselves probability expressions. All the probability 
expressions previously used are therefore of the type 

^ B) (1) 

V 

A and B may be very complicated expressions, but they themselves never 
contain a probability implication. In the P-notation this restriction is ex¬ 
pressed by the fact that within a P-symbol 

P(A,B) = V (2) 

other P-symboIs never occur, that is, no P-symbols are contained in A and B . 

There are a number of applications that cannot be interpreted by means 
of probability expressions of this particular form. We find instances in which 
we do not know for certain which probability exists in a given sequence. 
We speak, therefore, of probabilities of the second level; they are employed 
in probability statements concerning the existence of a probability. The itera¬ 
tion may be further continued: it is possible to make a probability statement 
about the existence of a probabilit}^ of the second level, so that a probability 
of the third level results, and so on. The operations that are carried out with 
probabilities of a higher level constitute the theory of probabilities of a higher 
level or the theory of the hierarchy of probabilities . 

For what problems do we need probabilities of a higher level? The ques¬ 
tion is linked to the problem of the origin of probability statements and 
therefore leads to considerations such as are given in § 17. In some cases, 
for instance, the throwing of dice or the drawing of balls from a bowl, we 
know the degree of probability before the sequence is produced; we then 
speak of an a priori determination of probability. But a superficial glance 
tells us that in such cases we do not have a completely reliable knowledge 
about the degree of probability, if only for the reason that the die may possess 
an unsymmetrical distribution of weight, or that the mechanism employed 
may be constructed inaccurately. Closer inspection teaches us that all cases 
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of a so-called a 'priori determination of probability must be analyzed in terms 
of the theory of levels of probability. 

In other cases we speak of an a posteriori determination of probability; this 
is applied when the probability is found by the enumeration of a given se¬ 
quence. Since we never observe the whole infinite sequence but only a finite 
initial section of it, it will be impossible to know the limit of its frequency 
with absolute certainty; so the analysis of the a posteriori determination of 
probabilities and thus of the inductive inference requires the theory of levels of 
probability. Consequently, the analysis of all probability statements requires 
the theory of levels; and, in fact, the theory of probabilities of the first level 
so far developed must be regarded as an approximation, which is applicable 
when the probabilities of higher levels are very nearly equal to 1. As before, 
however, discussion of such epistemological considerations will be postponed, 
and the mathematical theory of probabilities of a higher level will be devel¬ 
oped first. 

Since a probability of the first level refers to a sequence, a probability of 
the second level leads to a sequence of sequences, and therefore the lattice 
is the natural representation of such expressions. To simplify the presentation, 
it will be assumed throughout that the elements z ki of C or C are identical 
with the elements y k i of B or B, that is, we have an internal probability 
implication (see § 9); otherwise we would need two lattices z ki and y ki . The 
lattice ijki is symbolized in (3) 


Xi 

y n 

2/12 

2/13 

X 2 

2/21 

2/22 

2/23 

Xlc 

Vki 

]Jk2 

2/*3 


Vu 

Vki 


(3) 


The elements y k i belong to B or B and in another classification to C or C, so 
that we have a probability implication 

(B H -e-C ki y (4) 

p P 

We write p p for the degree of probability in order to indicate that this prob¬ 
ability is not constant for the lattice; the probabilities of the horizontal 
sequences differ from sequence to sequence, that is, the lattice is horizontally 
inhomogeneous. The implicational form of writing is used because expres¬ 
sions of a new logical structure are to be developed. Later the results will be 
translated into the P-notation. 

Since we wish to construct the probability of a probability, the expression (4) 
is to represent the second term within a probability implication, and so we 
must now explain the first term. For this purpose we have added in (3) the 
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column Xi ... x k , the elements of which correspond each to a hori¬ 
zontal row of the lattice; the elements x k belong to A or A. The probability 
to be constructed goes from an element x k belonging in A to the existence 
of a probability p p in the corresponding horizontal sequence. Thus we have, 
in the detailed notation, the following form for the inquired probability: 

(k) (x k eA-* [(f) (y ki e B -a- y ki e C)]) (5) 

Qp Pp 

For the formal calculus of probability the probability p p is a number coordi¬ 
nated to a horizontal sequence, and likewise the probability q p is a number 
coordinated to a sequence of sequences. In the frequency interpretation (5) 
moans that among the horizontal sequences are found sequences with the 
frequency limit p Pj the frequency of which, counted vertically, goes in the 
limit toward q p . 

As in (2 and 3, § 34), we translate (5) into the abbreviated notation by 
adding the subscript of the elements as superscript to the class symbols and 
repeating the running superscript (the bound variable) outside the paren¬ 
theses : 

(A k * (B ki ^C ki Y) k (6) 

<ip pp 

The technical relation to (1) is apparent: (6) results from (1) if in (1) the 
expression (4) is substituted for B and the superscripts are added. The addi¬ 
tion is necessary because of the lattice form. 

We can translate (6) into the /^-notation: 

P(A k ,[P(B k %C k 'y = p p ]) k = q P (7) 

In analogy to (7) a probability of the third level would be written 

P(A l ,P(B lk ,[P(C lki } D lki y = p p ]) k = q p ) 1 = r p (8) 

There are other kinds of probabilities of a higher level, which result when 
a probability expression is introduced into the first term of another probability 
expression. We then have expressions of the form 

{A k .{B ki -^C ki y^D k ) k ( 9 ) 

pp r p 

or, in the P-notation, 

P{A k \P(B k \C ki y = p P \D k ) k = r p (10) 

This expression means that we add to the lattice (3) a vertical column of the 
elements z k and consider the probability from the existence of a horizontal 
sequence with the probability p p to the existence of a corresponding z k € D. 
Expressions of this form must be included in the category of probabilities 
of a higher level. 
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Probabilities of a higher level must be carefully distinguished from prob¬ 
abilities of a higher kind, which were discussed with respect to the Bernoulli 
theorem (§ 50). In probabilities of a higher kind there occur within the P-sym- 
bol only frequency expressions referring to a finite number of elements, that is, 
to finite sections of sequences or to finite lattice sections. 1 Their existence 
can always be verified by the enumeration of a finite number of elements. This 
fact is clear from table 5 (p. 279), and holds even for the interpreted concep¬ 
tion, since throughout it is only the highest probability, that is, the prob¬ 
ability of the whole P-symbol, that refers to infinite sequences. Probabilities 
of a higher kind are, therefore, probabilities of the first level; they contain 
within the P-symbol only expressions of a finite reference. Probabilities of the 
second level, however, contain within the P-symbol expressions of an infinite 
reference , since the probability occurring inside the symbol refers to an infinite 
sequence and therefore can be verified only after the enumeration of an 
infinite sequence. 

The transition to probabilities of the second level finds a technical expres¬ 
sion in the fact that the lattices are horizontally inhomogeneous, whereas in 
the preceding chapters they were always homogeneous in the horizontal 
direction. It is true that we sometimes deal with vertically inhomogeneous 
lattices; and since vertically homogeneous lattices will be used occasionally 
in the following considerations, there would be no difference in such cases, 
since the two lattice directions are mathematically equivalent. For convenience 
in notation, however, a distinction will be made between horizontally in¬ 
homogeneous and vertically inhomogeneous lattices. The sequences that, 
taken as wholes, are elements of a frequency enumeration of sequences will 
always be written as horizontal sequences. On this condition, only the hori¬ 
zontally inhomogeneous lattice represents the transition to probabilities of 
the second level, whereas the horizontally homogeneous lattice belongs to 
probabilities of the first level. The distinction means no more than the con¬ 
vention not to ask questions concerning the frequency of vertical sequences 
with respect to the last-mentioned lattice. 


§ 59. Constants in Probability Expressions 

The introduction of probabilities of a higher type requires an extension of 
the calculus that will be developed first in the implicational notation. It will 
be necessary to introduce into probability statements expressions that do 
not depend on a subscript or a superscript and, therefore, represent constants. 
The introduction is possible if the constant is regarded as a function of the 
variable i having the same value for all i, a conception that is familiar from 

1 Only in § 51 did we deal with a probability of a different type, since the probability c n $ 
considered there is a probability of a higher level. 
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mathematical constants. As an illustration of a probability expression con- 
taining a constant, consider the probability that, when a die is thrown, the 
“G” shows up and Ohm’s law holds for electrical currents. This probability 
has the same value as the probability of the occurrence of face 0 alone 1 , since 
Ohm’s law is always valid. When 1) is the constant, this probability has the 
form 

(A^BKDy ( 1 ) 

V 

A similar form results if the expression introduced contains a variable /c, but 
does not contain the bound variable i of the probability expression, as in 

(A k .B ki ^C kl Y (2) 

p 

Here A k plays the same role as D in (1). Furthermore, the expression intro¬ 
duced may contain a bound superscript: 

(A\{B k y^rCy (3) 

p 

Here (B k ) k , that is, (k)(y k eB), has the meaning of a constant. It is even 
possible that one term of the probability expression does not contain a vari¬ 
able superscript, so that there result expressions of the form 


(D-B-ipy 

V 

(4a) 

(A* -a- Dy 

m 


p 


The variable superscript is given by the other term, and the constant term 
has the value I) for all i. 

For operations with constants in probability expressions no additional 
axioms are required; the rules of operation follow from the axioms of the 
calculus in combination with the general rules of symbolic logic if such 
expressions as (l)-(4) are admitted as meaningful. 

For the interpretation of expressions like (1), (4a), (46), we must realize 
that capital letters in probability symbols represent classes, not propositions. 
According to a remark made at the end of § 7, however, we can speak of the 
extension of a proposition; it is either the universal class or the null class, 
depending on whether the proposition is true or false. This interpretation 
follows because if d is a proposition we regard the expression ( x)d as mean¬ 
ingful and equivalent to d . The constant I) in (1), (4a), or (46), therefore, 
represents the universal class or the null class, whatever the proposition d 
may mean. Since the transition from the class symbol to the propositional 
symbol is made, in the notation used in this work, by the addition of paren¬ 
theses, the proposition d is to be written in the form (D), a form that means 
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originally (7 )D and thus is the same as d. Hut if the constant term inside the 
parentheses is originally a proposition, like the term ( B k ) k in (3), such terms 
will be considered to represent the corresponding class, that is, the universal class 
or the null class. That in such cases the notation does not supply an inde¬ 
pendent distinction between propositions and classes, making this distinction 
dependent on the occurrence of the expression within parentheses, seems with¬ 
out danger because of the isomorphism between the calculus of propositions 
and that of classes, although, of course, such a notation would not be expedient 
for other purposes. 

Consider the expression (4/;). The proposition d , or (74), coordinated to 
the class 77, will be either true or false; correspondingly, we have either 
(A * D D) ’ or (A 1 D 74)', a disjunction that expresses merely the lertium non datur. 
Applying axiom ti, 1 and (9, § 13), wo thus derive 

(A’^/4)0[(p - 1) V(p = 0)1 (5) 

P 

The alternative p = 1 holds if (74) is true; the alternative p = 0 holds if (74) 
is false. Furthermore, we have the tautology (74 D (A* D 74)'), which is a gen¬ 
eralization of the logical formula (8c, §4). By means of (1, §25) we thus 
derive 

(7) D [(A ' e- B i ) i = (74. A'e- 77')']) (0) 

V v 

Because of (I) D (/i‘ D /))') we derive from (4, § 25) 

(A‘ + D.B>)‘]) (7) 

P V 

Putting in (G) the special value ( A i ) i for 7), we obtain the expression (A‘) £ -A‘ 
on the right side; when we apply (1, §25) a second time, we see that this 
expression may be replaced by (A*)*, or ( A m ) m . We thus derive 

((A*') 4 4 [(A*‘ -3- 77')' s ((A m ) m ^ 77')]) (8) 

V V 

If the sequence A is compact, we may regard it as a constant if it occurs in 
the first term of the probability expression. 

With (67 and 7a, § 4) and the help of some simple transformations, we 
derive from (6) the two formulas 

(7). ( A i e- 77')O (D. A i e- B ')0 (9) 

V V 

(7). (A*-9-730* = D.iD.A^B^) (10) 

p p 

(74 D (D.A i ^-B i ) i ) 

p 


To (9) we can add 


ai) 
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This follows from (7, § 12). Formulas (9) and (11) together give 
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(A 1 -3- Ti'Y D (D.yl' e- B’)* (12) 

P P 

The truth of the formula is easily seen when we consider the meaning of D: 
if I) is the universal class, its addition on the right side does not change the 
reference class A ; if D is the null class, the reference class on the right side 
is empty and the corresponding probability may have any value, so that the 
implication holds. By similar considerations the meaning of other formulas 
may easily be explained. 

In formulas (9), (11), (12) the expression (D.A i ^a-B i ) i is implied by a 

v 

stronger expression; in (0) its equivalence to another expression is linked to 
a. condition. We now wish to find an expression that represents an unrestricted 
equivalence to (D.A'-a- B')\ It is supplied by the formula 

p 

(DD(A^B')') se (D'A'^B'Y (13) 

p p 

The implication from left to right contained in the equivalence sign is proved 

by transforming the implication written on the left side into (f) V (A* -a- B *')»); 

v 

since both terms of the disjunction imply the right side of (13) according to 
(11) and (12), we derive that the implication holds. The implication from 
right to left is inferred from (10) by the use of the implication from right to 
left contained in the equivalence sign of (10), the dropping of D on the left 
side, and the application of (6 d, § 4). 

An important special case of (13) obtains if D is replaced by (A m -a- B m ) m . 

v 

Then the left side of (13) becomes a tautology, and therefore the right side 
of (13), taken alone, also represents a formula that is always true: 

(A\(A m -a-B m ) m ^B i y (14) 

P V 

Here the existence of the probability is incorporated in the first term as 
condition, so that a tautology results. 

Formula (13) may be regarded as an extension of the logical formula 

a D (b D c) = a .b D c (15) 

In (13) the second implication sign on the left side of (15) is replaced by a 
probability implication, and consequently the implication on the right side 
of (15) also is replaced by a probability implication. We may ask whether a 
further extension can be constructed for the case in which the first implica¬ 
tion sign on the left side of (15), too, is replaced by a probability implication. 
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The question leads us back to probabilities of the second level (see 6, § 58), 
and will be discussed in § 60. 

The formulas of the present section cannot be expressed exhaustively in 
the P-notation, since they contain logical symbols outside the parentheses 
of the probability implication. Formulating the meaning of the symbols in 
words, the most important formulas for the P-notation are 


P(A\DY 


If (Z>) is true, we have 


1 for (D) 
0 for (/)) 


(5') 


P{A \B i ) i = P(A i .D,B i ) i (O') 

P(A\B i ) i - P(A i ,B i .D) i (7') 


Formula (13) cannot be w'ritten in this manner, but the formula will be used 
so far as we introduce permission to write the equivalent expression 
P(A i .D,B i ) i in place of the expression occurring on the left side. 


§ 60. Operations with Probabilities of the Second Level 

Operations with probabilities of the second level are determined entirely by 
the rules of operations with probabilities of the first level, when the rules for 
constants in probability expressions are added. Probabilities of the first level 
occurring inside probability expressions of the second level, taken as wholes, 
are treated in the same manner as other expressions, and are combined accord¬ 
ing to the rules of symbolic logic with one another or with other expressions. 
If the probability expression of the first level contains a free variable k, as in 
the expression (B ki ^-C ki ) i in (9, §58) or the expression P(B ki ,C ki ) 1 = p p 

Vo 

(10, § 58), it is regarded as the class of all situations x k to which the expression 
applies. If it contains no free variable, like the expression (A m -*-B m ) m in 
(14, § 59), it is treated according to the rules for constants. p 

That we do not introduce new axioms for the treatment of probabilities 
of a higher level leads to the consequence, however, that the possible opera¬ 
tions are rather restricted. In probability expressions like (6, § 58), the super¬ 
scripts i and k indicate two different directions of counting, and no inferences 
exist that would transform a probability enumerated horizontally into a 
probability enumerated vertically. The absence of such inferences is made 
obvious by the frequency interpretation: in dealing with lattice formation 
(§ 34), we pointed out that the limits of frequencies in the horizontal direction 
do not determine those in the vertical direction. The convergent lattice was 
offered as an instance in which the limits in the vertical direction are different 
from those in the horizontal direction. As explained on p. 280, we can imagine 
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an even more general lattice, in which, for instance, all elements on the left 
of the lattice diagonal are given as B, and normal sequences are added on 
the right in the horizontal direction. Then all vertical sequences possess the 
frequency limit 1; all horizontal sequences, however, have a frequency limit p. 

The restriction of possible inferences is apparent when we investigate the 
question raised at the end of § 59. With the use of (3, § 25) and (10, § 59) 
we can transform (6, § 58) into 

(A k (B ki C k 'Y) k s (A k ^A k .(B ki ^C ki y) k 

Q P (IP 

S (A**- {A k .B ki ^C ki Y) k (1) 

q p 

But this is all that can be achieved without further assumptions; the two 
probabilities p and q cannot, in general, be combined into one probability, 
because they refer to frequencies counted in different directions. 

Only when we consider more special cases is it possible to make inferences 
from one direction to the other. In earlier sections we studied more special 
lattices possessing determinate probabilities in both directions. In such lat¬ 
tices the direction of the enumeration of those probabilities of the first level 
that are counted vertically coincides with the direction of the enumeration 
of probabilities of the second level. In this case some conclusions can be 
drawn: we are able to infer relations holding between vertical probabilities 
of the first level and probabilities of the second level; we can also make 
inferences from horizontal probabilities of the first level to vertical prob¬ 
abilities of the second level. Such inferences are possible when certain relations 
between the two directions of enumeration are assumed. 

We introduce, first, the simplifying assumption that the sequence A and 
the lattice B are compact, that is, all x k belong to A and all y k i belong to B. 
This simplification does not restrict the generality, since it can always be 
carried through by a suitable choice of the enumeration. For the P-notation, 
however, a considerable advantage is thus achieved. 1 Assume, furthermore, 
the presence of horizontal sequences in the lattice to which belong the hori¬ 
zontal probabilities pi . . . p T in respect to C. We now define a subclass B p 
of B by the condition that it include those y ki that belong to a horizontal 
sequence in which the probability of C is = p p . A horizontal sequence of the 
y ki of this kind will be called a C-scquence of the kind B p . In symbols 

(y kl e B p ) = Df (y ki 6 B) . (i)(y ki e B € C) (2) 

p P 

1 In the noncompaet lattice B , complications will result if there are horizontal rows in 
which no element B occurs. According to (7, § 12) the probability P(B ki i C kt ) i has every 
numerical value for such rows; therefore we must count such rows in enumerating the 
probability of the second level, and we must do so for every p p . The.se conditions do not 
correspond to what we want to measure, and they create difficulties for the calculation, since 
the cases B p and B ff would not be exclusive for p db <r. 
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B k< = Df B ki .(B ki ^C ki ) i (2') 

*>P 

Since the sequence B is compact, we have 

Vp 


With (4, § 25) we now derive 

(A k * (B ki *.C k ') l ) k s (A k *-Ii k '.{B k '*-C k ')') k 

Qp Vp *Jp Pp 

= (A k * B k ;) k 


( 3 ) 


By means of this transformation the probability of the second level has 
assumed the form of a probability of the first level in which a free variable i 
occurs; a formula of this kind is valid for every i. Formula (3) enables us to 
give a simple representation of the probability of the second level in the 
/'-notation: 

P(A k ,B k ;y = q„ (3') 


This formula means that in every vertical column i the probability of finding 
an element y kl that belongs to B p is = q p . This probability is the same for 
all columns i because, if one element of a horizontal sequence belongs to B p , 
so do all. Note that for this reason formula (3') is not true if the superscript i 
is chosen as a running superscript. We rather have, because of (2'), 

l\A k ,B k ;y = cither 1 or 0 (3") 

depending on whether the horizontal sequence k is of the kind B p . However, a 
simplification results for the horizontal sequences. Because of (14, § 59) and 
with the use of (2) we can write 


or, in the P-notation, 


(B k ;^c ki y 

pp 

P(B k y,c k y = Vp 


( 4 ) 

( 4 ') 


The expressions are true for every value k because, if k denotes a sequence 
for which the horizontal probability is not p PJ the reference class is empty 
and thus any value of the horizontal probability may be asserted. Both (4) 
and (4') are probabilities of the second level, since they contain, after the 
elimination of the abbreviation B p according to (2), a probability of the first 
level in the first term. The character of a probability of the second level may be 
indicated by the Greek subscript in B k p \ in contrast to the Latin subscripts 
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used previously. In the following formulas we can return to the use of the 
/^notation. 

In order to make inferences possible, we assume, not only that the hori¬ 
zontal and vertical probabilities p p and q p exist, but also that they satisfy 
the two conditions: 

a. The disjunction B i V . . . V B T is complete, that is, 

X P(A\B k ;f = X> /p = 1 (5) 

p-i p=i 

That the disjunction is exclusive follows from the univocality of the prob¬ 
ability implication. 

b. The sublattice of the (7-sequences of the kind B p is homogeneous for 
every p, so that 

p(B k ;,c ki y = p{B k ;,c ki ) k = Vr (c>) 


Assumption b is the decisive assumption, enabling us to make an inference 
that reaches beyond (1). We obtain first, since ( B k p l D A k ) k because of the com¬ 
pact character of the sequence A, 


p(A k . B k ‘,c k, ) k = p{B k ;,c k ') k = p{B k ;,c ki y= Pp 

Furthermore, from the rule of elimination (21, § 19) we derive 

(7) 

P{A k ,C ki ) k = X ',P(A k ,B k ’) k ■ P(A k .B ki ,C ki ) k 

p -1 

(8a) 

= X P(A k ,B k ") k ■ P{B k ;,C ki ) k 

P 1 

m 

= X p{A k ,B k yy ■ p(B i y,c ki y 

p “i 

(8c) 

r 

= £ QeVo = V 

m 




The formula contains probabilities of the second level on its right side, 
and a probability of the first level on the left side. Thus it represents a rela¬ 
tion between probabilities of different levels. The relation is possible because 
the probability expression B p \ which occurs within the P-symbols on the 
right side, is eliminated according to the rule of elimination. In the form (8c) 
the formula represents a relation between probabilities counted in different 
directions; this relation is possible because a corresponding relation is given 
in the formula (6). 
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Returning to the implicational notation, we can write the result in the form: 
if the conditions (5) and (G) are satisfied, the formula holds: 

(p) [(A 1 (B ki e- C ")‘)1 3 (3 p) (0 {A**- C k ‘ f. (p = £ ?,/>,) (9) 

Qp V p V 

The bound variable p runs through the values 1 . . . r given in (5). Com¬ 
paring (9) with (1), we see that, on the conditions stated, two probability 
implications arranged in series can be replaced by one probability implication 
the degree of which is given by a mean value constructed from all the prob¬ 
abilities involved. Without the conditions mentioned, however, no such re¬ 
placement is possible. For the clarification of formula (9) it may be remarked 
that the all-operators referring to p and i are added because the formula 
cannot be written with the use of the free variables p and i 2 The operators 
express the facts that only when the left side is true for all values p does it 
imply the right side, and that the value p of (8d) holds for all values i, that is, 
for all columns. 

The result (8) bears some resemblance to the result (2G, § 56) obtained for 
the sequences defined by (23-25, § 56), if we write in (8) S\ K instead of B p and 
B for C; the latter sequences differ from (8) only in the fact that all the 
P(A,S X ') are equal. Yet the two expressions are fundamentally different. The 
symbol S\ K denotes an attribute that is attached to all elements of the subse¬ 
quence, or, in lattice enumeration, to all elements of the lattice sequence. In 
the illustration given for the assumptions (23-25, §56), S\* represents the 
property of being drawn from the bowl S\ K , an attribute that can be verified 
directly for every element of the sequence. The sign B p , however, does not 
represent an attribute that an individual element yki possesses in virtue of an 
individual characteristic. In order to find out whether y ki belongs to B p we must 
know the sequence of elements y k i (i running subscript) completely and we 
must know also the probability that exists in it with respect to C. This meaning 
of Bp is seen from its definition in (2). Therefore, the classification B p is a char¬ 
acterization with infinite means , whereas the classification S\ K represents a 
characterization by finite means. This analysis corresponds to the distinction 
between probabilities of the first and of the second level, which was formu¬ 
lated in § 58 in the statement that the first contain only finite expressions 
in the P-symbol, whereas the latter include infinite expressions in the P- 
symbol. 

2 If we omit the all-operators and write p and i as free variables, we can put all-operators 
referring to p and i before the whole formula, according to the rule for free variables; but the 
formula thus resulting does not have the meaning of (9). This follows from the formula 
(lid, § 25) given in ESL. p. 135; see also pp. 107 and 144 of that book. Because of the re¬ 
striction of the values p to the values 1 . . . r given in (5), the operator referring to p is a 
restricted operator (see ESL , p. 162). 
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The resemblance of the two kinds of probability has an obvious reason: 
if the classification of the dements of a sequence by simple observational 
criteria leads to subsequences of such a kind that the probability of a certain 
attribute within the subsequence is known, the probability of the observa¬ 
tional criterion in the major sequence is translatable into the probability of a 
subsequence of a certain probability and thus into a probability of a higher 
level. But it is an intrinsic difference whether the definition of the subsequence 
is given in terms of an observational criterion of its elements or in terms of the 
probability of the subsequence as a whole. Only in the latter case do we speak 
of a probability of a higher level. Formally speaking, however, we may use 
the symbol B p like a class symbol representing a classification by finite means. 
But the theory of such probabilities cannot be constructed without a lattice. 
Only if S\ K is an individual attribute are we able to say that an element of the 
major sequence belongs to the subsequence S\ K . If we omitted the indication 
Su for the elements of the major sequence (23 25, § 5b), we would not know 
into which subsequence a given element was to be incorporated. Therefore, 
we are always dependent on lattice formation for probabilities of the second 
level. 

Probabilities of the second level are distinguished from those of the first 
level in a further sense. That we are able to interpret the mean value p in (8d) 
as a probability—namely, as a probability in the vertical direction 3 —derives 
from the special condition (b) assumed for the lattice. The bowl schema is 
representative of such a lattice only when the drawing from a bowl satisfies 
the condition (G), which it usually does. Otherwise we arc confronted by a 
lattice of a different kind, in which the mean value p, computed according 
to (8 d), cannot be interpreted as a probability; and the two probabilities of 

3 We might suppose that it is possible to interpret the value p as a probability in another 
sense if we count the two-dimensional lattice in a one-dimensional arrangement, for instance, 
in the enumeration 

Z 11 Z 12 Z'l 2 Z21 z 31 Z32 Z 33 Z23 Z 13 Z\\ . . . 

But from the assumptions mentioned we cannot derive that such an enumeration must 
result in the limit p for the frequency. This impossibility is easily demonstrated by the con¬ 
struction of an opposite instance: draw the lattice lines running through z u z 2 3 z 3 s . . . and 
z\\ z 82 z^s . . . respectively, and put C ’s in the places of all the Zki lying between the lines; 
then all the vertical and horizontal sequences can be supplemented so as to form normal 
sequences, in the narrower sense, with the frequency limit p< p Then we can easily infer 
that the specified enumeration results in a frequency limit > p } since the C-elements in the 
inner sector always amount to approximately one-half of all elements. Conclusions in regard 
to the frequency resulting from the enumeration can be drawn only if we introduce assump¬ 
tions about a uniform convergence of all the horizontal sequences, that is, if for every 8 
there exists an n such that for all k 

I p _ F"(B ki ,C ki y I < 8 

holds. But we do not wish to introduce such assumptions (see § 65). 

If it should be found objectionable, by the way, that normal sequences in the narrower 
sense admit of “abnormal” lattices like the one mentioned, the definition of normal sequences 
in the narrower sense can be further restricted by suitable conditions—for instance, the con¬ 
dition that the same limit of the frequency should exist in every lattice line, that is, in every 
straight line drawn through the points of the lattice. 
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different levels occurring in a probability expression of the second level like 
(f>, § 58) cannot be combined in one probability. Here we must be satisfied 
with the statement of a pair q p , p p , or, when p is varied, of a series of such pairs. 

But even when the mean value of this pair-series can be interpreted as a 
probability, it does not provide? a complete substitute for the pair-series. 
Only for the vertical direction does it represent a probability; for the char¬ 
acterization of the horizontal sequences two probabilities, or a pair-series of 
probabilities, are always required. Probability expressions of the second level 
are always characterized by a pair, or by a pair-series, of degrees of prob¬ 
ability. This necessity is clearly expressed in (1): if we know that A and B 
exist, w r hat we wish to state about C can be expressed only by the two prob¬ 
abilities p p and q p . If it is true that we look forward to a chance event with a 
definite feeling of expectation, the intensity of which stands in a certain 
relation to the magnitude of the probability, we must admit that for prob¬ 
abilities of the second level we have to adjust our feeling of expectation in two 
different directions. 

Such a double adjustment, in fact, is recognizable in our behavior. If, 
before a horse race, a well-versed expert of the sport tells us that the winning 
chances of the favorite amount to 80%, and another racing fan, more enthusi¬ 
astic than expert, claims the same probability for the victory of the favorite, 
we shall evaluate the two identical statements differently: we place more 
trust in the statement of the expert. This means that his statement has a 
higher probability of the second level. It is obvious, however, that the higher 
probability of the second level is not expressible by a change in the prob¬ 
ability of the first level: we must not assume the probability of the victory 
of the favorite as smaller or greater if our only basis is the information given 
by the inexperienced fan. What is smaller is solely the probability that the 
probability of 80% is correct. If we have no better information, we should 
rather refrain from betting than bet on the basis of a value other than 80%. 

§ 61. The Dispersion in a Horizontally Inhomogeneous 

Lattice 

The mean value calculated in (8 d, § 60) 

T 

v = Z QpPp (i) 

P-1 

cannot be interpreted as a probability in the horizontal direction. However, 
for this direction, too, p is a quantity of mathematical import if the dispersion 
is calculated. 

Let 6f kn be determined in analogy to (12a, § 52) by 

5f kn = f kn - p 


( 2 ) 
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Then A 2 (/”) is defined as in (13a, § 52) by 

A Hf n ) = M(8 2 f kn ) k (3) 

Although, for a lattice that is homogeneous in the horizontal direction, A(/ n ) 
approaches 0 with increasing n, as can be seen from the theoretical value 
(15, § 52), this is no longer true of horizontally inhomogeneous lattices. For 
the limit n ->• oo there appear deviations 


<$p = V ~ Vp 


(-0 


of the frequencies of the (7-sequences of the kind B p \ therefore we have, for 
the limit n °°, 


A (D = 



p{A\s k ;) k 




(P - Vp?(Ip 


(5) 


The mean value p defined in (1) is distinguished from all other values by the 
fact that A(/ C1 °) becomes a minimum if this value is taken as the point of 
reference; for any other choice of p, formula (5) leads to a larger A(/°°). 
This follows from (12, § 37). 

Therefore, replacement of the frequencies of horizontal sequences of an 
inhomogeneous lattice by the mean value p has a certain meaning: by this 
replacement we commit the smallest mean error. 

The dispersion calculated with the me‘an value p becomes supernormal for 
larger n; this follows from the fact that it does not go toward 0 with increas¬ 
ing n. But if we want to infer the inhomogeneous character of the lattice from 
the dispersion, it is not sufficient to ascertain its supernormal character, since 
a supernormal dispersion may occur likewise in homogeneous lattices, for 
instance, in sequences with probability drag. Besides this result, we must 
therefore investigate statistically whether the dispersion approaches zero with 
increasing n. This is feasible to the same degree as the ascertainment of limits 
in general. For the mean value p we may take, for instance, the mean value 
of the frequencies in the vertical sequences. The investigation of the dispersion 
thus presents a method by which we can conclude whether the lattice is 
inhomogeneous. 

The dispersion of the total lattice can be calculated for finite n if we assume 
that for each p the sublattice of the C-sequences of the kind B p is normal in 
the narrower sense. Considering first one of these sublattices we obtain, with 
(12, §37) and (15, § 52), as its dispersion referred to p, 


Ap(/ n ) = 


Vp{X + (p - Vp) 2 


n 


( 6 ) 
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The dispersion of the total lattice, likewise referred to p, is the average of 
the dispersions of the sublatticcs, that is, 

A 2 C n = MiAliD), = ± q p . + (J> - p,) 2 ] (7) 

With the help of (1) this result can be transformed, by intermediate calcu¬ 

lations, into 

a< f) - + 1^-1 -tyr- rS- <s) 

The expression approaches the limit (5) for increasing n; for every finite n 

the value A(/ n ) is greater than A(/°°) and, at the same time, is greater than 
the normal dispersion calculated with reference to p according to (15, § 52). 
For more general types of lattices, the relation (8) is to be replaced by a 
different relation; but all such formulas must correspond to (8) so far as they 
must become identical with (5) for the limit n °°. 

§ 62. The Inductive Inference 

In respect to horizontally inhomogeneous lattices there arises a question that 
has no analogue for homogeneous lattices. Given a finite initial section of a 
horizontal sequence (assumed as standing in the ft-th row) for which the 

m 

frequency F n (B ki f C ki ) i has the value / = —, what is the probability that the 

sequence is controlled by a probability p — f ± 5, or in the frequency inter¬ 
pretation, that the sequence, upon prolongation, will converge toward a limit 
of the frequency within the interval / ± <5? Since the question concerns an 
inference from a given finite initial section to the infinite remainder of the 
sequence, it refers to an inductive inference (see § 17). 

The question cannot be answered unless certain specializing conditions are 
introduced. The range of possible values for p, which goes from 0 to 1, may 
be divided into small intervals of the length dp. A sequence of a probability 
that lies within the interval from p to p + dp will be called a sequence of 
the kind B Pidp , this symbol taking over the function of the symbol B p previ¬ 
ously used. If the term B Ptdp denotes the attribute class, the formula com¬ 
puted for a precise value p is a probability density and requires multiplication 
by dp to become a probability. If the term stands for the reference class, the 
formula computed for a precise value p is a probability and must not be multi¬ 
plied by dp; a relative probability function can have a precise value in the 
reference class (see § 45). We thus have, instead of (4', § GO), 

P(A.B k P \ dp ,C ki y = p 


( 1 ) 
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Instead of a set of probabilities of the second level q p , we then have a con¬ 
tinuous function q{p) such that 

P(AJ3 k p n , dp ) k « q( V )dp (2) 

Assuming that these probabilities exist, we take over condition a, formulated 
in (5, § 60), in the form 

f o <l(p)<ip = 1 ( 3 ) 

Instead of condition b , § 60, we introduce the stronger condition: 

c. The sublattice of the C-sequences of the kind B Ptdp is a lattice of normal 
sequences in the narrower sense. 

On this assumption, the probability of obtaining an initial section of the 
in 

frequency / = —, if the horizontal probability is = p, is given by 

n 

? P(A . B*: dp jC? = w nm = w n (p;f) /= f (4) 

where the symbol F^ n has the meaning defined in (10a, § 50) and the function 
w n (p;f), given by (13, § 57), has Bernoulli properties. 

The probability sought for is derivable from (2) and (4) by means of the 
continuous form of the rule of Bayes (8, § 45) and is to be written 

P(A. Ft:,B k p n ip ) k = V n (f-p)dp = . (5) 

Q(p)w n {p}f)dp 

The function q{p) takes over the role of the antecedent probabilities. The 

factor —-— drops out. 

n + 1 

The expression (5) may first be discussed on the simplifying assumption 
that q(p) = const., that is, the antecedent probabilities are equal to one 
another; the term q(p) then drops out. According to (14, §57), the denom¬ 
inator is then = 1; thus w r e arrive at the result 

Vn(f;p) = U'nip'J) ( 6 ) 

This relation may be stated in words as follows. The Bernoulli function 
w n (p;f) has the dual meaning of the probability density that the frequency 
is = / if the probability controlling the sequence is = p , and of the prob¬ 
ability density that the probability controlling the sequence is = p if the 
frequency is = /. The latter meaning is contingent upon equality of the 
antecedent probabilities and restricted to normal sequences. 
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The significance of the property (14, § 57) can now be stated. The forward 

probability that a certain frequency f = — will result, computed as long as 
the value p is unknown, is given by 


P(A,F*m f = f q{p)w nm dp = — L- f q(p)w n (p]f)dp (7) 

Ju n i Jo 

If q(p) = const., (3) gives q(p) = 1; because of (14, § 57) the probability (7) 


is then = 


1 


This means: as long as the value p is unknown, every one of 

m 

the n + 1 possible values of the frequency / = — is to be expected with the 
1 U 

same probability-. The condition (14, § 57) thus states that, if the ante- 

72 + 1 

cedent probabilities are equal, the forward probabilities of the possible fre¬ 
quencies are equal to one another. 

The probability of a value p between p x and p 2 is given by 


VniflVhV*) = I 
Jt>1 


( 8 ) 


Since the function iv n (p;f) satisfies the convergence relations (1G, §57), we 
derive from (G) and (8) the same convergence relations for v n (fyp ): 


lim v n (J)p h p 2 ) = 1 for p x g / ^ p 2 

n-*co 


lim v„(f;p h p 2 ) = 0 for /< />, or /> p 2 

U-* OO 


( 9 ) 


These convergence relations, however, are not restricted to the special 
case q(p) = const., but are demonstrable for the general case (5). For this 
purpose we integrate (5) according to (8): 


vJJ;puPd 


I 2 <l(v)Wn(p\f)dp 

J p ] 

J q(p)w»(p;f)dp 


( 10 ) 


The denominator may be divided into three integrals, for the case p x S f S P 2 , 

f 2 q(p)w n (p;f)dp (11) 

J V 1 


Vn(f;Pl,P2) = 


f l q(p)w n (p;f)dp + \ !i q(p)w„(p;f)dp + f q(p)w n (p;f)dp 
JO J Pi J Pi 
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Because of its Bernoulli character, the function w n (p;f ) has a maximum 
w n ,max the interval from p == 0 to p — pi and from p — p 2 to p = 1; 

w n ,max goes to 0 with increasing n. Replacing the first and the third integral 

of the denominator by integrals of the form te„, max * j*q(p)dp, we show that 

these expressions converge to 0 with increasing n, since (3) holds. The integral 
from j)i to p <2 dot's not converge to 0 if the function q(p) does not vanish for 
p = /. The expression (11) converges then to 1 with increasing n. On the 
condition q(p) ^ 0 for p = /, the first of the relations (9) is thus derived. 
The second follows because t’ n (/;0,l) = 1. 

The proof that the convergence relations (9) hold for every form of the 
function q(p), and are thus independent of the values of the antecedent 
probabilities, was first given by Laplace. Ilis proof was constructed for normal 
sequences. Poisson extended the proof to the sequence type that carries his 
name. Since the proof does not refer to the particular form of the function 
w n , but uses only its convergence properties, it can be extended to the general 
type of Bernoulli sequence defined in §57. The function v n (f;p ) is then con¬ 
structed by substituting the function w n (p,x i . . . s r ;f) in (5) and integrating 
over the .sq . . . s r with respect to their total range. For the proof of relations 
(9), the convergence properties (10, §57) of the function w*(p;f) are used. 
This proof, which can easily be given by analogy with the discussion of (11), 
will not be presented in this book. 

The general convergence theorem (9) is of great importance for the theory 
of induction by enumeration, or a posteriori determination of a probability, 
which states that the frequency observed for a finite initial section can be 
identified with the limit of the frequency, that is, with the probability con¬ 
trolling the sequence. The use of formula (9) presupposes only that the prob¬ 
abilities p and q{p) exist, that q(p) does not vanish for p = /, and that the 
sequences form a Bernoulli lattice. On these conditions we conclude from (10) 
and (9), putting p\ ~ f — 5 and p 2 — f + 8: 

(Hven a finite initial section of a probability sequence of the length n and 
the frequency /, there exists a probability v n that the observed frequency / 
represents the probability p controlling the total sequence within an interval 
of exactness ± 6. It is true that we cannot calculate v n if we do not know the 
function q(p); and we do not know whether v n is the maximum probability 
or whether the maximum belongs to a different value p. But we infer from 
(9): the larger n, the larger is the probability v n that the observed frequency 
represents the probability p within the interval of exactness ± 5; and v n goes 
toward 1 with increasing n. 

The convergence theorem by itself does not tell us what number n is large 
enough to make the inverse probability v n , formulated in (10), higher than a 
given value. But such information can be derived when at least a lower bound 
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q* > 0 is known for the value of the antecedent probability density q(p) in 
the interval from / — 8 to / + <5. On this condition a value v% can be con¬ 
structed such that 

Vn(f]f-S,f+ 5) ^ (12) 

lim v* n = 1 


The value v* n} which does not require a knowledge of the function q(p), can be 
found as follows. We have 


CS 4-5 

I q(v)dp = q* 25 

J/-5 


and because of (3) 


f g(p)dv + f q(v) d P 

JO Jf+ 5 


g 1 - g*2<$ 


(13) 

(14) 


As before, the largest value of w n (p;f) in the interval from 0 to / — 8 and from 
/ + 5 to 1 may be called w ntTnax . Replacing in the first and the third integral 
of the denominator of (11) the function w n (p;f) by wy max makes the whole 
expression smaller, or, at least, not larger. Dividing the numerator in the 
denominator and then replacing q(p) by q* in the integral extending from 
/ — 8 to / + 8 has the same effect. By the use of the inequality (14) thus 
obtains a value , 


v 


* 

n 


1 + 


(1 - <7* 2<5)Wn,max 



Wn(p;f)dp 


(15) 


which has the properties required in (12), because w n , max converges to 0 and 
the integral converges to 1. 

Some other formulas may be presented that have been derived within the 
theory of the inductive inference and that concern, not the probability of 
elements in the whole infinite sequence, but that of the frequency of elements 
in a finite section immediately following the given initial section, or consecu¬ 
tive section . In spite of the finiteness of the section, the problem concerns a 
probability of a higher level, because the horizontal sequences may have 
different probabilities p and the lattice is thus inhomogeneous. Since different 
probabilities p can produce the same consecutive section, a mean in terms of 
the possible values p must be computed. 

For the computation, the rule of composition (20, § 22) is used, which 
for lattice counting can be written 

P(A.D ki ,E H ) k = P(A ,D ki ,B ki ) k -P{A .D ki .B ki ,E ki f 


(16) 
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For B P) which in the problem under discussion refers to the sequence kind, 
we put B Vidp . The first interval dp may have pi as its mean value; the last, p T . 
The summation then extends from p\ to p T . For D ki , which refers to the ob¬ 
served initial section of the sequence, we put Fm, replacing i by n also in the 
other terms. 

The computation will be carried through on two specializing assumptions. 
The first is that the sublattice of the sequences B Pt d P is a lattice of normal 
sequences in the narrower sense. This includes the assumptions of lattice 
invariance and of independence of predecessors, so that the last probability 
can be transformed as follows: 

P{A .F k m n .B k ;, ip ,E kn t = P(A.B k p : dp ,E kn ) k = P(A . B k p \ dp ,E ki ) k 

= p(A.B k ;. d p,E ki y 07) 

The introduction of i in the third probability expression is permissible be¬ 
cause this probability, for normal sequences, does not depend on the variable 
n or i. The transition to the last form follows from lattice invariance. On the 
insertion of (17), formula (1G) assumes the form 

P(A.F k m n ,E kn ) k = £ P(A .F k m n ,B k p n , dp ) k -P{A .B k p \ ip ,E ki y (18) 

Calling the probability on the left s, using the notation (5) for the second 
expression, and putting r(p) for the last probability, we can write (18) in 
mathematical notation 

S = X v 'U’V) r( A>) d V (19) 

The second specializing assumption is that q(p) = const., that is, the ante¬ 

cedent probabilities are equal. This assumption allows for the use of the value 
(6) for v ny and thus (19) can be written 

s=f o w n (p;f)r(p)dp (20) 

This form will be used to answer several questions. 

If the given initial section consists of n elements C, whereas no elements C 
occur, what is the probability s n that the next element will be a C? 

For the answer, using (13, § 57) for m = n, we put 

s = s n w n (pj) = (n + l)p n r(p) = p (21) 

Formula (20) then gives 

f 1 n + 1 

» = (n + Dj o P^dp - ^ 


S i 


( 22 ) 



332 


PROBABILITIES OF HIGHER LEVEL 


This formula is called the rule of succession. 1 It determines the probability s n 
that an initial section of n elements, all of which art? (7’s, is followed by an¬ 
other element (7, if the sequence is normal and the antecedent probabilities 
for all values p are equal. The probability s n refers to a frequency in the 
vertical direction: among the horizontal sequences that begin with n elements 
(7, is a fraction, determined by s n , in which the n + 1-st element is a <7. The 
probability s n cannot be interpreted as a horizontal frequency because the 
first probability on the right of (18) cannot be written for horizontal count¬ 
ing; (3", § 00) shows that such counting would lead to a different value. 

The second question concerns the probability s n , n / of obtaining n' elements C 
after an initial section of n elements C. Leaving the function w n of (21) un¬ 
changed, for the answer we put 

s = Sn.n' Wn(p;f) = fa + 1 )//* r(j >) = p n (23) 


Formula (20) thus gives 


*Vn' = (W + 1 


) p n + n 'dp 


n+ 1 

71 -)- d - 1 


Whereas the value (22) convergers to 1 with increasing n, the value (24) does 
so only if n' is kept constant. If n is constant and n' increases, (24) goes 
toward 0. The inductive inference based on the observation of n elements C 
cannot secure a limit of the frequency that is strictly = 1, but only a limit 
close to 1 within a small interval of exactness d. 

The third question concerns the probability s/j• of obtaining a consecu- 

m' 

tive section of the frequenev /' = —; if an initial section of the frequency 

n 

771 

f = — is observed. The question requires the substitutions 

71 

s = S/J, Wn(v;f) = (n + 1) - p) n_m 

r(p) = ~ V) n '~ m ' (25) 


1 Since the formula is widely used in the literature, a simple derivation may be added 
that does not make use of (6). The probability that, if the probability of the sequence is 
a run of n elements C occurs, is = p n ; therefore the inverse probability of a value p, 
for equal antecedent probabilities, is given by (12, § 21) as 





for / = 1 


The probability of obtaining one more element (7, if p is the probability controlling the 
sequence, is = p; the rule of elimination (21, § 19) thus gives 



n + 1 
w.Hf 2 
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• s 7./' — ( n + I) 






(n + 1)! n'\ (m + m') \ (n — m + n f — vi')\ 
ml (//, — m)lm'l ( n f — m') \ (n + n' + 1)! 


( 20 ) 


For the computation of the integral, the auxiliary formula (15, § 57) has been 
used. Approximate values of s f j / can be found with the help of Stirling’s 
formula (13, § 49). For the values m — n , m' — n' — 1, (20) is transformed 
into (22). Form = n, m' = n\ (20) is transformed into (24). Form' = n' = 1, 

(20) assumes the form . , 

7 m + 1 

W (27) 


n + 2 


This value means the probability of obtaining an element C after the ob¬ 
servation of a section of m elements C and (n — m) elements C. 

In philosophical discussions the simple formulas (22), (24), (20) and (27) 
have often boon regarded as supplying a justification of the inductive infer¬ 
ence. Such a conception is incorrect, since the formulas do not possess a gen¬ 
eral validity; they hold only for normal sequences and even for these only 
on the condition q(p) — const., that is, on the condition of equal antecedent 
probabilities. This condition restricts the applicability of the formulas to 
cases in which some knowledge of the antecedent probabilities has been 
acquired in other ways, that is, without the use of the formulas. (See the dis¬ 
cussion of the rule of succession in §§72 and 8G.) 

Formula (9) is of greater significance, since it is based on weaker assump¬ 
tions; it presupposes only the Bernoulli character of the sequences, and holds 
for any values of the antecedent probabilities. Even this formula, however, 
cannot be regarded as supplying a general justification of the inductive infer¬ 
ence, because it, too, is based on special presuppositions. Its significance lies 
rather in the function that it performs in the further extension of the theory 
of induction, once the general justification of induction has been given. 
These logical questions will be discussed again in §§86 and 88-90. 
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THE PROBLEM OF APPLICATION 

§ 63. The Problem 

The exposition of this book has now reached a point at which the purely 
logico-mathcmatical line of thought is discontinued. The investigation turns 
into a new direction: its objective is the analysis of the relations between the 
mathematical calculus of probability and knowledge of nature, that is, the 
application of the calculus of probability to reality. The complex of these 
questions is called the problem of application , l In the course of this presenta¬ 
tion, numerous examples for the application of the calculus of probability 
have been given, but without reflection on the legitimacy of the application, 
which will now be analyzed in detail. 

The problem of application may be divided into two groups of questions. 
The first group includes the question what is meant by the concept of prob¬ 
ability as used in applications; we speak here of the problem, of the meaning 
of probability statements. The second group deals with the question whether 
we are justified in applying the rules of probability to physical objects, whether 
we are entitled to assert probability statements referring to physical objects 
when the statements are established by means of the rules of the calculus. 
This group may be called the problem of the assertability of probability state¬ 
ments. The two problems are closely connected, since the assertability will 
depend on the meaning assumed for the probability statement. I speak of 
assertability instead of truth because the requirement of truth would be too 
strong. It will be seen that a certain group of probability statements—those, 
namely, that state numerical probability values—cannot be proved as true, 
but are assertable on other grounds. 

The problem of meaning can be formulated as the problem of the inter¬ 
pretation of the P-symbol. It was explained above that in the formal concep¬ 
tion of the calculus of probability the P-symbol remains uninterpreted; the 
axioms then are conceived as a set of implicit definitions restricting the mean¬ 
ing of the P-symbol by subjecting it to certain formal conditions, without, 
however, defining the meaning exhaustively. Although the axioms exclude a 
class of interpretations as inadmissible, they leave open a class of admissible 
interpretations. The problem of meaning consists in selecting, among the ad¬ 
missible interpretations, the one that is suitable to cover the applications of 
the P-symbol to physical reality. 

There remains, of course, the possibility that several interpretations are 
used, varyi ng with the subject to which the term “probability” is applied. 

1 The term was introduced by Edgar Zilsel in Das Anwendungsproblem (Leipzig, 1916). 
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In fact, the term has two different forms of usage, which call for separate 
treatment. In the first usage the word “probability” refers to sequences of 
events or of other physical objects; in this usage the word must, without ques¬ 
tion, be interpreted as referring to a relative frequency, though the precise 
formulation of the interpretation requires further investigation. In the second 
usage, however, the term “probability” is applied to single events or other 
single physical objects; we must inquire whether this usage necessitates the 
introduction of a genuinely different interpretation or whether it is reducible 
to the frequency interpretation. Postponing the discussion to later sections 
(§§ 71-72), we shall turn first to an analysis of the frequency interpretation. 

§ 64. The Logical Form of Limit Statements 

According to the frequency interpretation, introduced in § 16, the probability 
of a sequence is defined by the limit of the relative frequency. In order to 
make the meaning of the limit statement clear, I shall formulate it by means 
of the logical symbolism. 

The limit statement may be formulated in two steps, which will be sym¬ 
bolized in the notation for the frequency sequence /” counted through (see 1, 
§ 48). On the first step we formulate the condition that, throughout the se¬ 
quence, terms/” that are as (‘lose to the limit p as we wish will always recur. 
In symbolic notation (see 3, § 89), 

(S) (n„) (3 n) (n > v 0 ).(f u = p ± 8) (1) 

Translated into words this means: however small we choose 8 and however 
large we choose n () , there is an element f n beyond f n o that is situated within 
p ± 8. In this case I shall call p a partial limit of the sequence. 

On the second step we set up a stronger condition, adding the requirement: 
for (‘very 8 , however small, there is an n Q such that, from/”o on, all/” remain 
within p ± 8. In symbolic notation this reads 

(5) (3 n„) (n) [(n > n„) D (/» = p ± 5)] (2) 

Only if (2) is satisfied do we call p the limit of the sequence. The limit is dis¬ 
tinguished from the partial limit by the requirement of a certain permanence 
in the approximations attained. 

The following relations hold for the assertions (1) and (2). Statement (2) 
is the stronger assertion: (1) follows from (2), as can be derived from (16a, 
§ 6). However, (1) can be satisfied without (2) being satisfied. The latter case 
will be exemplified later. 

The relation of the limit to the partial limit can be expressed as follows: 
a sequence that has a partial limit in p contains a subsequence that has a 
limit in p. 
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Furthermore, it can be shown that an assertion of the form ( 2 ) can be 
obtained by the negation of an assertion of the form ( 1 ), and vice versa. 
According to (13a and 136, § 6 ) and ( 66 , § 1 ) we have 

(5)(n 0 )(3??-)(n > n„). (f n = p ± 8) 

555 (3 5 )(3^)( n )f( n > n °) D (f n = V ± 5)1 (3) 

The left side negates the existence of a partial limit; the right side differs 
from ( 2 ) only by the term ( 3 8 ) and by the negation of the expression in 
parentheses concerning/". In all other respects it is of the same type as ( 2 ); 
in particular, it asserts the permanence of a certain property beginning with 
an element /V 

The assertion of the left side of (3) may be extended to a totality of values q: 

(</)[(</ V) 3 (5)(n 0 )(3n)(n> n a ).(f n = q ± <5)] (4a) 

The expression states that all q different from p are not partial limits. Because 
of (3), (4a) is equivalent to the statement 

(q) I (q ^ v) 3 ( 35 )( 3 a 0 )(tt)[(w> n a ) D (/” = q ± a)]} (46) 

In the language of the theory of sots, (1) is equivalent to the assertion 
that p is an accumulation point of the set / n ; correspondingly, the left side 
of (3) is equivalent to the assertion that p is not an accumulation point of 
the set/”. According to a theorem of the theory of sets, an infinite set situated 
between finite limits must have at least one accumulation point. Since for 
infinite sequences the/” represent an infinite set between the limits 0 and 1 , 
p must be an accumulation point if (4a) or (46) holds, that is, (1) must hold 
for p. Other accumulation points are not possible because of (4a) and (46). 
But then ( 2 ) also must be valid for p , that is, p must be a limit. If ( 2 ) were 
not satisfied for p an infinite number of elements /" would exist outside a 
certain interval p ± 5; these would have an accumulation point—a result 
that contradicts (4a) and (46). For infinite sequences, therefore, (2) follows 
from (4a) or (46). It is easy to show that the converse relation holds also. 
Consequently, (2) is equivalent to (4a) and to (46). For infinite sequences 
the assertion of the existence of a limit p can therefore be represented by 
the negation of an assertion about the existence of a partial limit extended 
to all q different from p. 

I turn now to the problem of how to find out whether a sequence has a 
limit according to ( 2 ), and which value of p constitutes this limit. This 
analysis requires a study of the form in which the sequence is given. 

In principle, a class or a sequence of elements can be given in two ways. 
First, the elements can be pointed out or enumerated individually, as in a 
list of students attending a certain class. A class or sequence given in this 
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way is said to be extensionally given. Second, the class or sequence can be 
defined by a rule that does not enumerate the elements individually. Of this 
sort, for instance, is the class definition, “all men whose height is between 
5^ and 6 feet”. The phrase states a rule by means of which it can be deter¬ 
mined for every individual object whether it belongs to the class defined. 
Such a class is said to be intensionaUy given. The difference in the way of 
giving a class or sequence becomes important when the operators “all” or 
“there is” are applied. 

As long as we deal with finite classes the application of the operators has 
a simple meaning; difficulties result, however, in the application of the oper¬ 
ators to infinite classes or sequences. For all-statements or existential state¬ 
ments referring to infinite classes, the problem of verifiability arises—the 
question how such statements can be evaluated as true or false. With regard 
to finite classes, the verification can be achieved, in principle, by checking 
individually every element of the class. But the procedure is not applicable 
to infinite classes. 

In investigating the consequences resulting from this fact, we shall consider 
first the intensional way of giving. With respect to classes given in this way, 
the concepts “all” and “there is” can be applied as adequately as for finite 
classes if we confine ourselves to statements that can be tested in terms of 
the rule of giving. If we say, for instance, “For all rational numbers it is true 
that the quotient of two of them is also a rational number”, we may be sure 
of the correctness of this proposition without testing it for every rational 
number. It simply follows from the rule of giving that is used for this class, 
that is, from the definition of a rational number. Similarly, we need not have 
qualms in asserting the statement, “There is a rational fraction greater than 
10 10 ”, because a rule of constructing such a fraction can be derived from the 
definition of rational numbers. Statements of type (1) or ( 2 ), therefore, can 
be verified for infinite sequences if the sequences are intensionaUy given. 

Two examples may illustrate this result. Consider the sequence 

= h + ( ~ l) n * [4 + i«] (5) 


The sequence a n satisfies statement ( 1 ) for the value p = f. But it does not 
satisfy statement ( 2 ), since elements whose distance from the value f is 
greater than 8 will always recur. This sequence has two partial limits, one at 
f and one at 
The sequence 


&» = !- + (- l) n 


1 

2 n 


( 6 ) 


satisfies not only statement ( 1 ) but also statement ( 2 ) for p = f. This can 
be verified without difficulty, since statements ( 1 ) and ( 2 ) can be derived 
from the rule of construction used for the sequence. 
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In regard to extension ally given classes or sequences, it is not necessary 
to assume that the extensional way of giving is possible only for finite classes. 
If we measure the temperature of a certain spot on the earth every day, we 
shall obtain an extensionally given sequence that in principle is not limited. 
This is not only a “practically infinite” sequence. If we assume that time is 
infinite, which is at least physically possible, the sequence can be continued 
oven after the individual existence of this planet has come to an end. To 
construct a suitable definition we can define the sequence in terms of the 
temperature of a certain point in the universe, using light waves as time 
units. Infinite sequences, in the sense of sequences that can be continued end¬ 
lessly, can be given extensionally. In another respect, it is true, the manner 
of giving is intensional: a rule is given by means of which for every n the 
corresponding element of the sequence can be constructed. However, the way 
of giving does not determine what sort of element will occur, that is, what 
the attribute of the element will be. The attribute is ascertainable only by 
the observation of the respective element; it is extensionally given. If someone 
asks, for instance, for the temperature on the 100th day of the series of meas¬ 
urements, there is a method of determining it: on that day, place a thermome¬ 
ter on the specified spot. From this rule, however, nothing can be inferred as 
to the height of the temperature. We can say: the sequence of the elements of 
the reference class is intern sionally given; the sequence of the attributes, however , 
is extensionally given. This way of giving is possible because the coordination 
of an attribute to an element is performed by means of an act of nature. 

This result can be formulated as follows: if x t and ?/, are elements of the 
sequences, the statement x t e A is derivable, whereas the statement ?/* e B is 
not derivable. The reason is that the term A is used in the definition of the 
sequence (which is defined as compact in A) t whereas the term B is not used. 
The word “extensional” refers, therefore, to a relation between a predicate 
of an element and the defining rule of a class, or sequence; it says that this 
relation is synthetic. 

The sequence of throws of the die is another example of this kind of sequence. 
The reference class is given intensionally, namely, by the rule, “Take a die 
and toss it on the table”, whereas the sequence of the attributes, that is, of 
the faces turning up, is extensionally given. Corresponding considerations 
hold for all sequences of practical statistics, even in dealing with “practically 
infinite” sequences—sequences that are never completely known to us. Thus, 
if we are concerned with the frequency of male births, we know from the 
definition of this sequence that each of its elements is a birth. Only by indi¬ 
vidual observation of each case, however, shall we know whether the element 
is a male or a female birth. 

If we wish to make all-statements or existential statements about infinite 
sequences that are extensionally given, we encounter great difficulty. Con- 
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sider statements simpler than (1) and (2), in which only one of the two opera¬ 
tors “all” and “there is” occurs. Take, for instance, the positive existential 
statement, “Among the temperatures occurring in the sequence, there is one 
above 30°C”. Can we verify the statement? If it is true, we can, at least in 
principle'. In checking the series we must once come upon such a temperature; 
a single case is sufficient for verification. If the statement is false, however, 
we shall never be able to falsify it. We shall find no temperature above 30°C 
among the finite number of observed cases, but we cannot know whether 
or not such a temperature will occur later. Such a statement may be called 
unilaterally verifiable . The restriction of verifiability results from the fact 
that the statement asserts an unspecified existence. If we were to indicate the 
day on which the temperature above 30° is to occur, the statement would be 
fully verifiable. 

Inverse relations of verifiability hold for negative existential statements. 
Consider the statement, “There is no temperature above 30° among those 
occurring in the sequence”. If the statement is false we shall come upon an 
example to the contrary in checking the series, that is, we shall recognize the 
falsity of the statement. But if it is true we shall never know its truth. Thus 
the statement is also unilaterally verifiable, though the conditions of verifi¬ 
cation are reversed. 

Corresponding considerations hold for all-statements. “All temperatures 
occurring in the sequence are below 30°”; this positive statement is verifiable 
if it is false and is not verifiable if it is true. The negative statement, “Not all 
temperatures are below 30°”, is verifiable if it is true and is not verifiable if it 
is false. The result follows from considerations similar to those above; it is 
clear also from the equivalence of all-statements and existential statements 
formulated in (13a and 136, § 0). 

The conditions of verifiability become even more unfavorable when we 
consider more complicated statements like (1) and (2) in which both an 
existential operator and an all-operator occur. If, for the sake of simplicity, 
we neglect the all-operator binding the variable 5, statement (1) will be pre¬ 
sented in the example by the statement, “There will always recur tempera¬ 
tures above 30°”, which is completely unverifiable. If the statement is true, 
we shall repeatedly find days where the temperature is above 30°; but we 
cannot know whether such temperatures will continue. If it is false, there 
will be a day after which we shall never find temperatures above 30°; but we 
cannot know whether or not such temperatures will be observed later. The 
impossibility of coming to a decision results from a combination of the impos¬ 
sibility of verifying the all-statement for the positive case and the impossibility 
of verifying the existential statement for the negative case. 

An example for (2) is the statement, “From a certain day on, no tempera¬ 
tures above 30° will be found”. It is easily seen that this statement, too, is 
completely unverifiable. 
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The result, of the analysis must now be applied to probability sequences. 
Both ways of giving, the intensional and the extensional, can be carried 
through for probability sequences. These terms will be referred only to the 
attribute class; an intensionally or extensionally given probability sequence 
is a sequence for which the attribute is given, respectively, intensionally or 
extensionally. The strictly alternating sequence (1, § 20) represents an exam¬ 
ple of an intensionally given probability sequence. Here is to be classified, 
also, Henri Poincare’s 1 example of the last digits of a table of logarithms, 
among which odd and even numbers are equally frequent. Sequences of throws 
of the die, social statistics, and so on, are examples of extensionally given 
probability sequences. Transferring the results of the analysis of limit state¬ 
ments to probability sequences, we arrive at the following conclusion: 

With respect to intensionally given probability sequence's of an infinite 
length, statements expressing a frequency interpretation have* the usual 
meaning of mathematical all-statements and existential statements; like the 
latter, they are strictly verifiable. With respect to extensionally given prob¬ 
ability sequences of an infinite length, however, statements expressing a fre¬ 
quency interpretation are not verifiable. 

The first result is important for the mathematician. It proves the existence 
of an interpreted calculus of probability—a calculus of probability dealing 
with probability as a frequency, in which probability statements have the 
usual meaning of mathematical limit statements. The convergence, therefore, 
can be formulated in the usual way. For every 5 the n can be calculated for 
which f n represents the place of convergence. The interpreted calculus of 
probability thus constructed is a discipline of the same type as other mathe¬ 
matical disciplines dealing with infinite sequences that are given by defining 
rules; it must be conceived as a branch of arithmetic. 2 

With regard to extensionally given probability sequences, however, the 
result is different. We saw that for them the frequency interpretation of prob¬ 
ability leads to completely unverifiable statements because the statements 
include both all-operators and existential operators. An extensionally given 
sequence can never be realized in its entirety; we know only an initial section 
of it, and its infinite remainder is a matter of the future. It is obvious, how¬ 
ever, that the relative frequency holding for the initial section imposes no 
restrictions on the limit of the relative frequency existing for the infinite 

sequence: given a relative frequency f n — — for the initial section, it is pos- 

n 

1 Calcul des probabiliUs (Paris, 1912), p. 313. 

2 This result is due to the elimination of the principle of randomness. The calculus of 
probability of von Mises cannot be treated on logically equal terms with the other mathe¬ 
matical disciplines because von Mises' collective cannot be given intensionally. My normal 
sequences, on the contrary, can be so given; see my remarks on Copeland's model of normal 
sequences in § 30. 
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sible to imagine a continuation of the sequence of such a kind that the fre¬ 
quency converges toward any value p selected arbitrarily. In other words, 
let m f and n f be the respective frequencies of elements counted from the place 

rn -f- m! 

of the n-th element; then the expression- 7 for m f and n' going toward 

n + n 

infinite values, will converge toward a limit independent of the constant 
values m and n. With respect to extensionally given sequences of an infinite 
length, therefore, the limit statement is not verifiable. 

This result makes questionable the meaning of the limit statement for 
infinite sequences of the extensional kind. Assuming the verifiability theory 
of meaning, 3 we demand that if a statement is to be regarded as meaningful 
it should be possible to verify it as either true or false (the word “verify” is 
used here in the neutral sense, denoting a determination as true or as false). 
Since such verification is not physically possible for extensionally given se¬ 
quences of an infinite length, we must inquire in what sense the limit state¬ 
ment can be maintained for them, or what other change of interpretation 
could be envisaged. Before entering into such an analysis, we shall inquire 
whether the interpretation in terms of a limit is indispensable for a frequency 
interpretation. 


§ 65. The Necessity of the Concept of the Limit for the 
Frequency Interpretation 

In order to find out whether the limit statement can be dispensed with in 
a frequency interpretation, some other possible frequency interpretations will 
be considered. According to the interpretation by a partial limit as intro¬ 
duced in (1, § G4), the statement of a probability p would require that it 
should always recur that the relative frequency approaches p to any desired 
degree of exactness, whereas no statement about a permanence of the con¬ 
vergence would be added. Because of the rather weak nature of such a postu¬ 
late, I attempted, in an earlier publication, 1 to develop this interpretation, 
but in the meantime I found such serious arguments against it that I aban¬ 
doned the idea. 

The first argument is that even statement (1, § 64) is completely unverifi- 
able and therefore cannot relieve the difficulty under discussion. Another 

3 See H. Reichenbach, Experience and Prediction (Chicago, 1938), §§ 4-8. Hereafter re¬ 
ferred to as EP. 

1 “Der Begriff der Wahrscheinlichkeit fiir die mathematische Darstellung der Wirkliohkcit” 
(Diss. Erlangen, 1915), pp. 70-71, and in Zs. f. Philos, u. philos. Kritik , Vol. 162 (1917), 
pp. 246-247. The decision to attempt this interpretation was supported, also, by an objec¬ 
tion to the limit interpretation, which will be discussed presently (p. 346), since the objection 
becomes inapplicable for the case of the partial limit. I no longer regard the objection as 
tenable, however; so this argument for the interpretation by the partial limit is eliminated, 
too. 
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argument results from the fact that a statement about a permanence from a 
certain element on, as employed in the definition of a limit, can be conceived 
logically as the negation of another statement about a partial limit (see § 64). 
Thus we are compelled to admit a statement of the form (2, § 61) as mean¬ 
ingful if a statement of the form (1, § 64) is accepted; otherwise we should 
exclude the operation of the negation. In other words, since (4a, § 64) is 
equivalent to (2, § 64), the statement about the limit can be replaced by 
another statement containing only the concept of the partial limit and the 
negation. Therefore, if we admit the concept of the partial limit, we must 
also admit the concept of the limit unless we intend to renounce the opera¬ 
tion of negation. This is the reason why the interpretation by a partial limit 
does not carry any logical advantage. 

We shall now turn to a second attempt to reach a frequency interpretation 
of a weaker form. We shall construct an interpretation by a frequency state¬ 
ment that is at least unilaterally verifiable. The following axiom of interpre¬ 
tation is introduced: 

Axiom of interpretation a. If an event C is to be expected in a sequence 
with a probability converging toward 1 ) it will occur at least once in the sequence. 

This is a rather modest assumption, compared with the interpretation in 
terms of the limit. That it is of the type of statements that are unilaterally 
verifiable follows because it requires the occurrence of only one event for its 
verification. The interpretation so constructed would therefore seem logically 
preferable to that of the limit. It can be shown, however, that if the interpre¬ 
tation through the axiom a is assumed, the existence of a limit is deducible 
for normal sequences. We are therefore led back to the original interpretation. 

The proof is bast'd on the amplified Bernoulli theorem (§51), which is 
derivable in the formal calculus of probability, without the use of the fre¬ 
quency interpretation. We understand by C the property that f n is the place 
of convergence for 5. Then c n s is the probability of C ; since the c„a converge 
toward 1, C must occur once. If C, however, has occurred for / n , C holds 
also for all the following elements according to its definition. That is, f n and 
at the same time all the following f i are situated within p ± 5; this is the 
assertion of the limit. 

It is remarkable that such a weak assumption as the axiom of interpreta¬ 
tion a has so far-reaching a consequence. The result shows that the rejection 
of the interpretation in terms of the limit, for normal sequences, leads neces¬ 
sarily to the rejection of the axiom of interpretation a, that is, to the asser¬ 
tion that it is possible that an event that is expected with a probability con¬ 
verging toward 1 will never occur. It seems clear that whoever admits this 
possibility must abandon every attempt at a frequency interpretation. It is 
the aim of a frequency interpretation to reduce a probability statement to a 
certainty statement about frequencies; yet a statement demanding less than 
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the axiom a is hardly conceivable. This axiom connects the modality of cer¬ 
tainty with the probability 1 by a minimum of requirements. Tt does not 
state that the probability 1 is identical with certainty. That would be asking; 
too much. It states only that an event that is to be expected in the limit with 
the probability 1 will occur at least a single time in the sequence. This as¬ 
sumption, which, taken by itself, would admit even the value 0 for the limit 
of the relative frequency in regard to such an event, leads, in connection 
with axioms of the formal calculus of probability, to the interpretation in 
terms of the limit. 

Other assumptions, for instance, the interpretation by a partial limit, can 
likewise be shown to lead to this consequence. I therefore regard this proof 
as sufficient evidence that the interpretation of probability in terms of the 
limit of the frequency cannot be dispensed with. The proof makes evident 
the great importance of the theorem of Bernoulli: although taken alone this 
theorem cannot be used for a derivation of the frequency interpretation of 
probability, it leads to the interpretation in terms of a limit of the frequency 
as soon as it is combined with an assumption as weak as the axiom a. 

In this connection I shall deal with an objection that was raised against 
the limit interpretation. 2 The attempt was made to construct a contradic¬ 
tion between the limit interpretation and Bernoulli’s theorem by the following 
consideration. If the sequence of the frequency has a limit p, there must exist 
an n for a given 8 such that the frequency beyond/” does not deviate from p 
more than 8 . From this fact it follows, however, that the elements succeeding 
the element y n are in some way restricted. Thus a run of elements B beginning 
at the place y n cannot surpass a certain length r. According to Bernoulli’s 
theorem, on the contrary, a certain probability greater than 0 must be allowed 
for a run of 6* elements B (s > r). Thus it seems that at the place y n a section 
having a certain probability greater than 0 must be regarded as impossible 
when the limit interpretation is used. 

The contradiction is solved as follows. We must distinguish between the 
probability that a run of $ elements B will occur and the probability that a 
run of s elements B will occur at the place y n . In the frequency interpretation 
the first probability refers to sections all of which are situated within the same 
sequence. It is quite possible that such runs occur beyond the element y n 
without disturbing the convergence, since for greater n a run of the length s 
will not cause a transgression of the convergence limit 8. The second prob¬ 
ability, however, refers to lattices of sequences, since here the place y n is 
specified; for its interpretation we must use a frequency in the vertical direc¬ 
tion. Now the following property holds for a lattice of normal sequences in 
the narrower sense: if all horizontal sections extending from y n to y n + 9 are 
counted in the vertical direction, a certain number of sections containing only 
2 W. Sternberg, in Zs. /. angew. Math. u. Mech. } Vol. IX (1929), p. 501. 
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elements B will be found among them, in correspondence with the Bernoulli 
theorem. This does not contradict the limit interpretation, because not all 
the horizontal sequences will have their places of convergence for 6 at the 
same n. The lattice is not required to exhibit a uniform convergence: in each 
sequence there is an n for a given 5 such that f v is the place of convergence 
for 8 , but there does not exist one such n for all sequences. With this solution, 
which was first offered by von Mises, the contradiction disappears. 

With these results we have returned to the limit interpretation. Its neces¬ 
sity is indisputable if a frequency interpretation is to be adopted. We must 
therefore find a way to eliminate the logical difficulties of the limit interpre¬ 
tation presented in § 04. 


§ 66. The Meaning of Limit Statements 

We saw the logical difficulties in the fact that the statement about the limit 
appears meaningless for extensionallv given sequences of an infinite length. 
A limit at a given value p is compatible with every finite beginning of the 
probability sequence; since we can count the frequency only in a finite initial 
section, all limit statements must be called nonverifiable and, consequently, 
meaningless. 

The analysis of the problem of meaning with respect to sequences of physical 
events has suffered from the fact that it has been too closely attached to the 
mathematical formulation of the calculus of probability. Only in the mathe¬ 
matical version do we find infinite sequences; the sequences of actual statistics, 
however, are always finite. They are so, not only w r ith respect to the initial 
section for which the statistics were compiled, but also with respect to the 
section of the sequence that lies in the future and is not yet observed. In fact, 
we are interested only in finite sequences because they will exhaust all the 
possible observations of a human lifetime or the lifetime of the human race. 
We wish to find sequences that behave, in a finite length of these dimensions, 
in a way comparable to a mathematical limit, that is, converging sufficiently 
well within that length and remaining within the interval of convergence. If a 
sequence of roulette results or of mortality statistics were to show a noticeable 
convergence only after billions of elements, we could not use it for the appli¬ 
cation of probability concepts, since its domain of convergence would be 
inaccessible to human experience. However, should one of the sequences 
converge “reasonably” within the domain accessible to human observation 
and diverge for all its infinite rest, such divergence would not disturb us; we 
should find that such a semiconvergent sequence satisfies sufficiently all the rules 
of probability. I will introduce the term practical limit for sequences that, in 
dimensions accessible to human observation, converge sufficiently and remain 
within the interval of convergence. Sequences of this kind will include only 
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a subclass of all sequences having a limit and will include also a subclass 
consisting of semi convergent sequences. It is with sequences having a prac¬ 
tical limit that all actual statistics are concerned. 

In applying the rules of probability to such sequences, we shall find that 
theorems derived from the axioms of the elementary calculus (axioms r-rv, 
§§ 12-14) hold strictly, since these axioms are strictly valid for finite sequences 
(see § 18). Theorems derived in the theory of order, however, will be found 
to hold only approximately, since the axioms v, § 28, hold strictly only for 
infinite sequences. The conditions of types of order will be regarded as satis¬ 
fied if they hold to a certain degree of exactness, and the inexactness will 
become worse for subsequences when the number of elements becomes smaller. 
For a random sequence, the frequency of groups of successive elements will 
follow the Bernoulli theorem only if the group is not too long, whereas for long 
groups the deviations between calculated and observed frequency may surpass 
any degree. 

It would be possible to transform the calculus of probability so that it 
takes account of finite sequences. We should then speak, not of a limit of the 
frequency, but of a limiting interval e and should have to stipulate, in every 
theorem referring to subsequences, which larger interval of convergence 8 
we are willing to admit. A calculus of this kind would be rather complicated 
and cumbersome. In comparison, the calculus in its present form must be 
regarded as an idealization , with the advantage, however, that it is much 
simpler and easier to handle. This is reason enough to prefer it. In this respect 
we follow the practice of surveyors who apply the ideal notions of geometry 
to physical elements that satisfy the axioms of geometry only to a certain 
extent. With respect to all applications, in fact, the notions of a line without 
width and a point without spatial extension must be regarded as idealizations 
that are never realized by physical objects. The practical geometry of lines 
of small width and points of small extension can be identified with the ideal 
geometry to a certain degree of approximation. We deal with the latter in all 
computations because its mathematical structure is much simpler. For the 
same reason we shall not use the concept of a practical limit in the following 
chapter, but shall carry through all technical analysis with respect to infinite 
sequences. We know that the results will hold approximately for practical 
limits. 

These considerations will settle the problem of meaning for probability 
sequences of physical events. To those who insist that every meaningful 
statement should be strictly verifiable in a finite number of observations we 
answer that all sequences with which we are concerned are sequences of a 
practical limit, and thus a finitization of the limit statement can be carried 
through. There can be no doubt that this finitization will satisfy all require¬ 
ments of the verifiability theory of meaning in its most rigorous form. 
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The situation is different with extensionally given sequences when an 
infinite length is admitted, as in the purely mathematical conception of the 
calculus of probability. If infinite sequences of the extensional kind are used 
in the sense of an idealization, statements referring to such sequences must 
be meaningful in some sense, or it would not be possible to translate them 
into approximative statements holding for finite sequences. An interpreted 
calculus of probability that deals with infinite sequences not given by rules, 
for example, the random sequences of von Mises and Church (see § 30), must 
be classified along with certain chapters of the theory of sets that likewise 
do not satisfy the requirements of the verifiability theory of meaning in the 
form assumed by physicists. Thus the axiom of choice maintains the possibility 
of a selection from an infinite class of classes, though in general we have no 
means of defining the selection by a rule that is statable in a finite number 
of terms. Since the present work is concerned primarily with the probability 
calculus of physical events, an analysis of these logical questions is beyond 
its province. I may add, however, that a solution seems possible when the 
meaning of such postulates as the axiom of choice, or of limit statements 
about infinite sequences not given by a rule, is regarded as being of a type 
that I have called logical meaning. 1 For this type of meaning the logical 
possibility of verification, as distinct from the physical possibility, is regarded 
as sufficient. A fact is called logically possible if it is no contradiction to assume 
that it happens, though the occurrence may be excluded by physical laws. 2 
It is called physically possible if there are no physical laws excluding it. Thus 
an increase of energy in a closed system is logically, though not physically, 
possible, whereas interplanetary travel is not only logically, but also physi¬ 
cally, possible. The enumeration of an infinite sequence can be called only 
physically impossible; it is not a logical impossibility. In other words, it is 
no contradiction to assume that an infinite sequence is known element by 
element, though such knowledge is not physically possible for a human 
organism. The ultimate answer to all such questions, of course, is closely 
connected with the problem of a proof of consistency in Hilbert’s sense. 

If the proposal of logical meaning for statements as described is not ac¬ 
cepted, it may be possible to construct a meaning for them on the basis of the 
fact that they are translatable into approximative statements for finite sequences 
or classes. Such a meaning, at least, is all that is needed for the theory of 
probability so far as it is applied to physical events. It may then be called 
a fictitious meaning. For the purpose of these investigations, however, such 
questions need not be discussed. In its physical applications the frequency 
interpretation of probability can be given a finitist meaning, satisfying the 
verifiability theory. 

1 See EP. §§ 6-8. 

2 The definition of physical laws, or nomological statements , is given in ESL, chap. viii. 
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§ 67. The Assertability of Probability Statements in the 
Frequency Interpretation 

We turn now to the problem of the assertability of probability statements, 
still restricting ourselves to the frequency interpretation of such statements. 
Two questions must be answered. The first concerns the assertability of prob¬ 
ability laws, that is, theorems of the calculus and thus of formulas that do not 
include assertions of specific numerical values of probability. The second 
concerns the assertability of numerical degrees of probability and therefore 
may be formulated as the question of the ascertainment of the degree of prob¬ 
ability. 

The question of the assertability of probability laws is easy to answer. 
It was shown in § 18 that axioms of the calculus of probability follow tau¬ 
tologically from the frequency interpretation. Therefore 4 , if probability is 
regarded as meaning the limit of a relative frequency, the validity of prob¬ 
ability laws is guaranteed by deductive logic. This result is of major importance 
for the epistemological critique of probability statements. So far as theorems 
of the calculus of probability are concerned, the application of probability 
statements to physical objects offers no greater problems than that of theorems 
of deductive logic. In other words, with respect to theorems of the calculus 
there is no problem of application that is specific for probability. Now the 
assertability of theorems of deductive logic is based on the fact that logical 
formulas are empty, that they do not anticipate properties of the physical 
world. In the same sense, therefore 4 , theorems of probability must be regarded 
as empty when applied to relative frequencies. The theorems constitute mere 
transformations of probability expressions into others, without any addition 
as to content. The very emptiness, however, supplies the reason for the use¬ 
fulness of probability laws. If the laws were not empty it would not be per¬ 
missible to add them to given premises, since only emptiness of the laws can 
guarantee that the results of transformations do not state more than is 
supported by the premises. That, nonetheless, the conclusions of probability 
transformations can be new, in the psychological sense, is obvious to the 
logician who is familiar with the twofold nature of tautologies as logically 
empty but full of psychological content. 

In this context the construction of the theory of order assumes particular 
importance. We saw that all special types of probability sequences can be 
characterized by postulates stating the equality of probabilities in certain 
subsequences. The treatment of any sequence type, therefore, requires no 
addition to probability axioms other than statements about the equality of 
degrees of probability. It follows that the verification of statements classifying 
a sequence with respect to its type of order can be achieved as soon as methods 
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for the determination of numerical probability values are known. The analysis 
of such a classification thus finds its place in the discussion of the ascertain¬ 
ment of the degree of probability. We see why a solution of the problem of 
the assertability of probability laws cannot be given until a general theory 
of the order of probability sequences is constructed. Without such a theory 
we should have no proof that all probability laws are tautologous when the 
frequency interpretation is assumed; we could not exclude the possibility 
that some unknown synthetic law concerning the type of order is implicitly 
involved. 

The problem of the ascertainment of the degree of probability, in contra¬ 
distinction to the problem of probability laws, offers serious difficulties. They 
originate from the difficulties in the verification of the limit statement explained 
above (§ 64). 

In extensionally given sequences-—and probability sequences referring to 
physical reality are of this type—verification of the limit statement is impos¬ 
sible if the sequences are infinite. It has been emphasized that sequences of 
physical applications are not infinite and that verification by enumeration is 
possible in principle. Although this result is sufficient to settle the problem 
of the meaning of the limit statement, it does not help so far as the assertability 
of the limit statement is concerned. The reason is that in all practical applica¬ 
tions we wish to know the value of the limit before the sequence is completely 
produced; indeed, all practical use of probability statements consists in the 
fact that they are applied for the prediction of relative frequencies, and thus 
cannot be based on counting the total sequence. It. is the predictive nature 
of probability statements, therefore, that presents difficulties to a proof that 
such statements are assertable. 

This consideration makes clear that the ascertainment of the degree of 
probability must be achieved by methods other than counting the relative 
frequency in the total sequence. In practical applications, two different 
methods are used. The first is called the a posteriori determination of prob¬ 
abilities; the second, the a priori determination of probabilities. 

The a posteriori determination is identical with a procedure known in logic 
as induction by enumeration , if the term is applied in a somewhat wider sense 
than in traditional logic. It is based on counting the relative frequency in an 
initial section of the sequence, and consists in the inference that the relative 
frequency observed will persist approximately for the rest of the sequence; 
or, in other words, that the observed value represents, within certain limits of 
exactness, tin; value of the limit for the whole sequence. This inference is 
called the inductive inference; its formulation is called the rule of induction. 
A strict formulation of the rule will be given later (§ 87). The inductive 
inference of traditional logic is of a somewhat narrower form; it refers to 
initial sections in which all elements have the same attribute B and assumes 
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the permanence of B for the rest of the sequence. This inference, which may 
be called classical induction , is a special case of the general form, namely, the 
case where the observed relative frequency is = 1. To distinguish the general 
rule of induction from the special case of classical induction, it may be called 
statistical induction . 

The problematic character of the inductive inference has often been dis- 

m 

cussed. In § 64 it was pointed out that a relative frequency ~ observed in an 

initial section is compatible with any value p for the limit of the relative 
frequency in the whole infinite series. For sequences of a finite length the 

TYl 

frequency — will impose certain restrictions on the limit; but if the sequence 

is long enough—and that is true of practical applications—the restrictions 
are negligible and the value of the limit is virtually independent of the ob- 
m 

served value —. The inductive inference, therefore, is essentially different 

from a deductive inference; it carries no logical necessity with it. For the 
present, however, discussion of the logical nature of the inference will be 
postponed. 

It would be a simple way out of the difficulty if it could be shown that other 
methods than the inductive rule are available for the ascertainment of a degree 
of probability. Therefore we must now inquire whether an a priori determina¬ 
tion of a degree of probability is possible. 

§ 68. The So-Called A Priori Determination of the 
Degree of Probability 

The a priori determination of the degree of probability occupies a special 
position in the historical development of the philosophy of probability. The 
determination of probabilities on the ground of properties of symmetry, 
without the use of frequencies, has been regarded by some authors as the 
nucleus of the problem of probability. It has been contended that every other 
determination of probabilities must be reduced to the same logical schema. 
In the pursuit of this idea it was believed that all attempts to solve the 
philosophical problems of the concept of probability should start at this point. 

Mechanisms like the die and the roulette wheel are characterized by the 
existence of a complete and exclusive disjunction the terms of which are 
equiprobable. The theorem of addition permits us to infer, from the number r 

of the terms of the disjunction, that the probability is = - for every single 

term. With respect to such disjunctions it is therefore possible to reduce 
the determination of the degree of probability to the determination of the 
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concept equiprobable. On this fact was based the well-known definition of 
probability as the ratio of the favorable to the possible cases. 

The idea led to the conception that the starting point of every determina¬ 
tion of probabilities must be a disjunction consisting of equiprobable terms. 
Nonequiprobable terms of a disjunction, according to this conception, must 
be reduced to another disjunction consisting of equiprobable terms, from 
which the nonequiprobable terms are derived by transformations. The reduc¬ 
tion may be illustrated by the computation of the probability of obtaining a 
number greater than 5 by throwing two dice: here the two nonequiprobable 
cases of the alternative “greater than 5” and “not greater than 5” are reducible 
to the original equiprobable cases given by the faces of the die. 

It is true that in some cases such a reduction is possible as well as justified. 
Thus the greater probability of male births as compared with female births 
may be explained by the fact that in the surroundings of the ovum the number 
of spermatozoons producing the male sex somewhat surpass the number of 
spermatozodns producing the female sex, and the probability of fertilization 
is assumed to be equal for both kinds of spermatozoons. It is not true, however, 
that reduction of the degree of probability to equiprobability is always pos¬ 
sible. Theories like the Spielraumtheorie of Johannes von Kries, 1 which regard 
this idea as the solution of the problem of probability, must therefore be 
rejected as untenable. 

A more important objection must be added: even if the degree of prob¬ 
ability can be reduced to equiprobability, the problem of probability is only 
shifted to this concept. All the difficulties of the so-called a priori determina¬ 
tion of probability, therefore, center around this issue. 

In the search for an argument by which the equiprobability can be derived, 
a principle was established that must be regarded as the foundation of all 
a priori determination of probability: the principle of indifference , also called 
the principle of no reason to the contrary. It maintains that events are equi¬ 
probable when there is no reason to assume that one should occur rather than 
another. Thus we have no reason, it is argued, to favor one of the faces of 
the die; therefore they are equiprobable. Some authors present the argument 
in a disguise provided by the concept of eguipossibility: cases that satisfy 
the principle of “no reason to the contrary” are said to be equipossible and 
therefore equiprobable. This addition certainly does not improve the argu¬ 
ment, even if it originates with a mathematician as eminent as Laplace, 2 
since it obviously represents a vicious circle. Equipossible is equivalent to 
equiprobable. It appears advisable to discuss the principle of indifference in 
a form that avoids this fallacy, and ask whether the absence of a reason to 
the contrary can guarantee equal probabilities. 

1 Die Principien der Wahrscheinlichkeitsrechnung (Freiburg i. B., 1886). 

2 Essai philosophique sur les probabiliUs (Paris, 1814; rev. ed., Gauthier-Villars, 1921, p. 9). 
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At first sight the principle seems plausible, since we actually use such 
inferences in the determination of probabilities. The inference is often asso¬ 
ciated with a feeling of self-evidence that makes the conclusion seem “logical”. 
But the fact that we use and believe in inferences of this type does not prove 
that they are valid. On the contrary, a brief reflection on the content of the 
principle shows that it is invalid. The absence of a reason to the contrary 
is a condition of our knowledge; equiprobability is a condition holding for 
physical objects. Why should the occurrence of physical events follow the 
directive of human ignorance? Perhaps we have no reason to prefer one face 
of the die to the other; but then we have no reason, either, to assume that 
the faces are equally probable. To transform the absence of a reason into a 
positive reason represents a feat of oratorical art that is worthy of an attorney 
for the defense but is not permissible in the court of logic. 

That the principle leads to correct results for mechanisms like the die 
must be explained by the fact that in such mechanisms more conditions are 
realized than are formulated in the principle. The principle represents an 
instance of the fallacy of incomplete schematization (§21); it does not state 
all the conditions that must be satisfied for equiprobability. It leads to cor¬ 
rect results only where the additional conditions are satisfied; but general 
usage leads to absurd conclusions. Thus it would follow that every event 
about the occurrence of which we know nothing is to be expected with the 
probability J, an inference that is not improved by the fact that some philos¬ 
ophers maintain the right to use it. Or consider the probability of male 
births: so long as nothing was known about the mechanism of fertilization, 
the probability was assumed to be = |, since there was no reason to expect 
a male birth rather than a female birth—a result that is now known to be 
false. If it is argued that there were reasons to expect the contrary, derived 
from the statistical prevalence of male births, we must answer, first, that the 
argument represents the use of an inductive inference rather than that of 
the principle; and second, that if the inductive inference is admitted it will 
falsify the conclusion that results when the principle is applied without 
knowledge of birth statistics. 

The latter criticism indicates the weakest spot in the a priori theory of 
probability: if both an a posteriori and an a priori determination of the degree 
of probability are admitted, their results may contradict each other. The 
only possible defense of the a priori determination, therefore, consists in the 
attempt to restrict it to a meaning of probability that is not expressible in 
terms of frequencies and is not verifiable by the inductive rule. The discussion 
of this attempt will be given in § 71. For the time being, the result may be 
stated as follows: the principle of indifference is not admissible when the 
frequency interpretation of probability is assumed. 

In § 69 it will be shown that the equiprobability for mechanisms like the 
die is derivable from certain of their physical properties, among which 



§ 69. EQUIPROBABILITY IN GAMES OF CHANCE 355 

geometrical symmetry plays an important part. The invalid inference can 
be replaced, in such applications, by a correct inference. Like other fallacies, 
the principle owes its apparent plausibility to the fact that in many instances 
it can be applied successfully, though the correct explanation of such instances 
is rather involved. 

The absurdity of the principle was demonstrated, in particular, for appli¬ 
cations to geometrical probabilities, where it leads to the conclusion that, if 
nothing is known to the contrary, equal areas obtain equal probabilities. 
This consequence, however, entails contradictions for transitions from one 
attribute space to another when the two spaces arc 4 coupled by a nonlinear 
measure transformation. To which of the two spaces should the principle 
be applied? If equal areas possess equal probabilities in one space, they 
cannot do so in the other. 

For instance, assume it is known only that the specific weight of an un¬ 
known substance lies between 4 and 0 . The principle then leads to the conse¬ 
quence that there is a probability of \ that the specific weight lies between 
4 and 5, the 4 same holding for the interval from 5 to 0 . Applying the same 
inferences to the determination of the specific volume, which is the reciprocal 
value of the specific weight, we arrive at the 4 result that the 4 specific volume 
is to be expected with a probability greater than \ in the interval from \ to 5 , 
since the difference i — 5 is greater than the difference i — This result 
contradicts the previous one. 

It should be noticed that tlu 4 difficulty is inherent in the principle and is 
not introduced artificially. In fact, if nothing is known about a preference 
of certain areas in one attribute 4 space, the same must hold for the other, 
and the contradiction is unavoidable. If the contradiction is to be eliminated, 
the principle would have to be supplemented bv a rule selecting the attribute 
space to which it is to be applied. No such rule has ever been suggested, and 
it is hard to see how it could be formulated. 


§ 69. Explanation of the Equiprobability in Mechanisms 
of Games of Chance 

The explanation of the equiprobability in mechanisms of games of chance 
can be given by the use of an idea that was first developed by Henri Poincard . 1 

1 Calcul des probability (Paris, 1912), p. 149. I used the idea in a demonstration showing 
that the application of the calculus of probability contains no assumptions different from 
those presupposed in the application of the principle of causality: “Der Begriff der Wahr- 
scheinlichkeit fur die mathematische Darsteliung der Wirklichkeit” (Diss. Erlangen, 1915), 
and in Zs.f. Philos, u. philos. Kritik , Vol. 161 (1916), p. 209; Vol. 162 (1916), pp. 98, 223. 
Even though this first among my papers referring to the problem of probability was written 
under the influence of Kant’s epistemology, it seems to me that the result concerning the 
theory of probability can be stated independently of Kant’s doctrine and incorporated in 
my present views. The present section represents such a restatement of the analysis given 
in the paper. 
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The example of the roulette wheel offers itself for the exposition. Assume that 
the wheel is played in such a manner that before each spinning the indicator 
is brought into the same initial position. The entire angle w of the rotation 
of the indicator may be counted in multiples of 27r, that is, from the initial 
position to the stop. The probability that a certain value w will obtain is 
determined by a probability function <p(co) (see fig. 25). The following con¬ 
siderations are based on the existence of such a function, whereas the actual 
form of the function need not be known. 



Fig. 25. Probability function for roulette game, used 
for demonstration of equiprobability of “red” and 
“black”, co is rotation angle of hand counted in multiples 
of 2x. 


The division by black and red sectors supplies a division by intervals of 
equal width Aw on the axis of the abscissa w. To simplify the problem, assume 
that no sector is designated for the profit of the bank; so there are black and 
red sectors only. Every second interval represents the result “red”; the other 
intervals lead to the result “black”. The corresponding probability is given 
by the respective stripe of ordinates. In figure 25 one of the sets of stripes is 
shaded; it may represent the result “black”. The probability of obtaining 
“black” will then be represented by the total area covered by the shaded 
stripes; the probability of obtaining “red” is given by the area of the other 
stripes. 

Now it is easy to show that the sum of the shaded stripes almost equals 
the sum of the unshaded stripes if the function ^>(w) does not oscillate too 
much. The two sums of stripes will become equal in the limit Aw = 0 even 
if the function <p(w) is submitted to no other condition than that of continuity. 

This theorem will be demonstrated first for a finite section extending from 
wo to w n . Consider two successive stripes as forming a pair. If <p' is the smallest, 
and <p" the greatest, value of the ordinate <p within the pair, then the difference 
in the area of the two stripes is not larger than 


(*>" - <p') • Aw 


a) 
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The number of such pairs of stripes is -. In one of the - pairs the difference 

2 2 

(<p" — (p') has its greatest value, which is called (<?" — p') ma x. Then the 
difference between the sum of the shaded and the unshaded areas is not 
larger than 

\ ■ W - ^Omax Aco (2) 

With increasing n the value (<p" — <p') max approaches 0 because of the con¬ 
tinuity of <p( w), whereas n • Aw remains always equal to the constant value 
c On — co 0 . Consequently the expression (2) converges to 0. Thus the proof is 
given. 

The proof can be extended to probability functions the abscissa of which 
is not bound to the limits w 0 and running through all values from — <» 
to + oo. Since the total area of all the stripes is finite because of (6, § 42), 
it is possible to choose two limits w 0 and w n so that the main part of the area 
between the curve and the axis of abscissas is situated between these values 
and the remaining part of the area becomes smaller than a given number e. 
The foregoing proof can then be given for the main part of the area. By 
transition to the limit e = 0, the proof is extended to the whole area. 

When this consideration is used for the explanation of the game of roulette, 
the following assumptions must be made: 

1. The probability of a rotation angle w is determined by a continuous 
probability function <p( w). 

2. The intervals Aw are equal in size. 

3. The size of the intervals Aw is small with respect to the oscillation of the 
function <p(w). 

The last assumption is necessary because in a practical case we never deal 
with the limit Aw = 0. 

The assumption 1 represents a very general assumption about the existence 
of probabilities; it does not contain any metrical presuppositions. 

The assumption 2 concerns metrical relations; it has no reference to prob¬ 
ability, however, since it refers to geometrical relations. In order to confirm 
it, we have only to measure the size of the sectors of the roulette wheel. 

The assumption 3 represents a certain rough appraisal of the metrical 
properties of a probability function that may, however, remain undetermined 
within wide limits. 

The equiprobability assumed for the game of roulette is easily derived 
from these three assumptions. But only two of the assumptions, the first 
and the third, express probability statements. The second has nothing to do 
with probability. Therefore, we have reduced the equiprobability assumed 
for the game of roulette to weaker probability assumptions, namely, to the 
assumptions 1 and 3. They state less than the assertion of equiprobability, 
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since the equiprobability can be derived only if the second assumption is 
added. If the second assumption is replaced by another, for instance, by the 
presupposition that the black sectors are twice the size of the red sectors, 
we can no longer infer the equiprobability of “black” and “red”, but shall 
rather derive a probability relation of 2 to 1. In this sense the theory supplies 
an explanation of the equiprobability. The latter has been derived from more 
general assumptions. 

This reduction represents an inversion of the method of reduction that 
was attempted in the theory of equiprobable cases. We do not reduce a prob¬ 
ability metric of a nonsymmetrical form to an equiprobability, but, con¬ 
versely, reduce a symmetrical metric, that is, an equiprobability, to a non¬ 
symmetrical probability metric. This way of handling the problem carries 
the advantage that the principle of “no reason to the contrary” is completely 
eliminated. The equiprobability does not appear as following from the 
absence of reasons, but as a result of the existence of definite reasons, namely, 
of the occurrence of the facts formulated in the second assumption. Only 
the question how we arrive at the assumptions 1 and 3 remains to be discussed. 

We are dealing here with assumptions that do not specifically concern 
mechanisms of games of chance, but apply also to physical problems of a 
very different nature—for instance, problems of the theory of errors, since 
every evaluation of measurements of physical values will depend on assump¬ 
tions of this type. Even the' rough appraisal of degrees of probability, employed 
in the assumption 3, is used in many cases that apparently have nothing to do 
with probability. We always make* use of such appraisals in daily life when 
we regard statements about future events as “practically certain”. This 
result shows that the assumption of the equiprobability concerning mech¬ 
anisms of games of chance does not contain any assumptions different from 
those used in probability statements made in everyday life or in science. We 
must regard the mechanisms of games of chance as mechanical devices by 
which certain general properties of physical phenomena are transcribed into 
the special form of equiprobable cases. Mechanisms of this sort express the 
probability character of nature in an instructive and simplified form, but by 
no means do they represent the logical archetype of the problem of probability. 

With this analysis the seemingly self-evident character of the principle of 
“no reason to the contrary” finds its explanation. Cases in which this prin¬ 
ciple is applied always refer to events that, according to the schema of figure 
25 (p. 356), can be reduced to a division by small and equal intervals with 
respect to a probability function. For more complicated mechanisms, for 
example, the die, we shall use a probability function of several variables; 
yet the foregoing mathematical considerations can easily be extended to 
cases of this kind. 2 Our belief in a connection between the symmetry of the 
_2 "See~HTReichenbach, in Zs.f. Physik , Vol. II (1920), p. 163; Vol. IV (1921), p. 448. 
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terms of the disjunction and the equ [probability originates from a correct 
estimation of the role that is played in such problems by the assumption 2. 
The apparently a priori inference of symmetry is actually a legitimate' infer¬ 
ence that leads from an a posteriori knowledge, that is, from the assumptions 
1 and 3, to an equiprobability by the use of strictly deductive methods as 
soon as the geometrical symmetry of the* cases is assumed. Tilt* inference' of 
symmetry is thereby clean'd of its mysterious character and becomes a 
logically justifiable' method. 

This analysis clarifies the question of nonlinear measure transformations 
also. When the roulette wheel is surrounded by a square' frames and the real 
anel black sectors are 1 projected on the' frame, the areas into which the frame 
is divieled will not be equal in size. Yet when the spinning hanel is replaced 
by a spinning light beam, the probability that the light beam ce)me v s to rest 
is equal for real anel black are'as. However, if the frame is divideel into small 
equal are'as, alternately e*olored, the probabilities e>f “red” and of “black” 
will be equal, too, if the are'as are sufficiently small. Tim reason is that the 
proof e*xpre\sseel in figure 25 (p. 356) is independent e>f the' shape of the curve?; 
the equality of the probabilities derives from a limit pre>pcrty of the curve 
anel is therefore invariant with respect te) ne)nline'ar transformations. 

The'se consielerations answer the question of the se)-callcel a priori determi¬ 
nation of the degree of probability. The successful applications of the me'thoel 
can be r<*duced to probability assumptions of a more general character in 
combination with physical assumptions of another sort, concerning geo- 
metrie*al or me'chanical relations. The'se inferences replace? the fallacie>us infer¬ 
ence' in terms of the principle of indifference. What has been regareled errone¬ 
ously as an a priori determination of a eie'gree of probability is thus revealed 
to be a derivation from other probability assumptions, the validity of which 
was established a posteriori , by means of the rule of induction. 

§ 70. The Three Forms of an A Posteriori Determination of 
Degrees of Probability 

The analysis presented in the preceding sections can be summed up in the 
following statement: there is no a priori ascertainment of a. degree of probability; 
a probability metric can be determined only a posteriori. 

There are three possible w T ays for a posteriori establishment of a prob¬ 
ability metric: 

1. Degrees of probability can be directly ascertained through induction 
by enumeration ( statistical probabilities). 

2. A probability metric can be inferred deductively from known probabilities 
(deduced probabilities). 

3. A probability metric can be inferred by means of general inductive 
methods from known observational data ( hypothetical probabilities). 
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Method 1 has been pointed out. Method 2 was illustrated in § 69 by the 
derivation of a probability metric from the assumption of a continuous prob¬ 
ability function. The primary probabilities can be ascertained by means of 
induction by enumeration. The value of method 2 consists in the fact that it 
can determine a probability rather precisely, even if the original probability 
values obtained by the inductive rule are known only within wide limits of 
exactness. The result is demonstrated by the considerations attached to 
figure 25 (p. 356). 

Method 3 has not yet been mentioned. It is carried through as follows: 
a probability metric is introduced in the sense of a hypothesis; then the 
observational consequences of the assumption are computed and tested, and 
thus the truth of the assumption is judged in terms of its consequences. 
This inductive inference is of the same type as any other inference from 
observational data to a hypothesis. The fact that in this case the hypothesis 
concerns a probability metric makes no difference. 

An illustration is the probability metric on which the theorems of the 
kinetic theory of gases are based. The equiprobability of all arrangements of 
molecules in cells of equal size in the velocity space, and likewise the ergodic 
hypothesis, are hypothetical assumptions from which observational data con¬ 
cerning thermodynamic phenomena are derived and which, conversely, are 
tested through these phenomena. The a 'posteriori character of the method is 
demonstrated by the fact that when, in the realm of low temperatures, observ¬ 
able data were discovered that did not conform to the Maxwell-Boltzmann 
statistics, the assumed metric was replaced by a different assumption, as 
expressed in the Einstein-Bose statistics. These statistics correspond to a 
metric described in § 32 as the metric of nonindividualized combinations. 
Such inferences offer no particular problems. The determination of a prob¬ 
ability metric by these methods is to be incorporated in the general method 
of indirect evidence and to be discussed in the frame of the analysis of this 
method. 

The method of introducing a probability metric hypothetically has found 
an elaborate treatment in the method of statistical inference, developed in 
the last decades by R. A. Fisher, J. Neyman, 1 E. S. Pearson, and A. Wald. 2 
The general problem dealt with in these investigations is to find the most 
plausible probability distribution that comprises a given set of observational 
data. For instance, a physical quantity u has been measured n times, the 
values U\ ... u n being the numerical results. What function d(u) represents 
the most probable assumption for the probability distribution of these values? 
Problems of this kind cannot be solved without further assumptions. For 

1 “Outline of a Theory of Statistical Estimation Based on the Classical Theory of Proba¬ 
bility/ J in Philos. Trans. Roy. Soc. London , Series A, Vol. 236 (1937), p. 333. 

2 On the Principles of Statistical Inference, Notre Dame Mathematical Lectures (Notre 
Dame, Indiana, 1942). This booklet includes a report on the literature in this field. 
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instance, if it is known that the distribution is a normal curve, its form is 
ascertained by the methods described in § 43, that is, by the statistical deter- 
mination of mean and standard deviation. The mathematicians mentioned 
have developed methods for more general cases, which have acquired great 
practical significance. The applicability of every such method depends on 
whether the statistical material satisfies the assumptions presupposed for the 
method. The validity of the assumptions is tested along with the method; 
if the method leads to results that conform to further observation, it is 
accepted in the way any other physical hypothesis is regarded as confirmed 
through observation. 

Of the three methods, the second and third can be carried through only 
when certain probabilities arc known; the first method alone is applicable 
without such knowledge;. The first can, therefore, be used as a primary method 
of finding probabilities, whereas the other two are secondary methods. Through 
both the latter methods a probability is derived deductively from other 
probabilities. Tn the second method, the derived probability is the inquired 
one; in the third method, however, the situation is somewhat involved. The 
hypothesis inquired is statistical, that is, it states a probability, or a probability 
distribution; however, the deductive inference does not supply this prob¬ 
ability directly, but supplies the probability that the hypothesis is correct 
and thus the probability that the inquired probability holds. The transition 
to a determinate probability distribution, or statistical hypothesis, is then 
made through selection of the distribution for which the maximum probability 
obtains. For this reason, hypothetically introduced probabilities cannot be 
derived with certainty from given probabilities, and their use will always 
include the uncertainty of a hypothesis of which we know only that it is the 
most probable one. Usually the inference is so devised that it omits the deter¬ 
mination of the probability of the second level and supplies directly the 
distribution that possesses the maximum probability. 

With respect to the third method, the necessity of a previous knowledge 
of some probabilities has not always been recognized, and attempts have 
been made to construe this method as a primary means of finding probabilities. 
These attempts break down for two reasons. First, the method presupposes 
a great deal of inductive knowledge for the selection of the general form of 
the assumed hypothesis, the choice of which depends on previous experiences 
in the same field. Second, since the method is meant to supply a probability 
for a hypothesis, it is dependent on the inference by indirect evidence, 
which cannot be applied unless certain probabilities are known. This fact is 
obvious when it is realized that the inference by indirect evidence is a form 
of the rule of Bayes and therefore determines the probability of a hypothesis— 
an inverse probability—as a function of certain forward probabilities (§ 21). 
Among these, the probabilities referring to the observational data, the prob- 
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abilities P(A.Bi,C k ) of (10, §21), can often be deductively derived; this 
will be the case if the B t represent statistical hypotheses. The antecedent 
probabilities P(A,B t ), however, must be determined empirically. Often this 
determination can once more be achieved by means of an inference by indirect 
evidence; but it is obvious that ultimately inferences by indirect evidence 
can be made only after some probabilities have been ascertained by the first 
method, statistical enumeration. 

With the intention of evading this consequence, the mentioned attempts at 
construing the third method as a primary method use inferences that dispense 
with a statistical determination of antecedent probabilities. The aprioristic 
school of probability makes use of the principle of indifference for this purpose; 
unknown antecedent probabilities are regarded as being equal to one another, 
and Bayes’ rule assumes the simple form (12, § 21), which contains no ante¬ 
cedent probabilities. The preceding discussion of the principle of indifference 
(§ 08) makes it clear that such a procedure is not permissible. Empiricist- 
minded thinkers, however, have tried to construct an inference by indirect 
evidence that does not presuppose antecedent probabilities. Of this kind is 
the inference by confirmation , the fallacious character of which was explained 
at the end of § 21. Another attempt makes use of a principle of maximum 
likelihood , which was introduced by R. A. Fisher 3 with the? intention of elim¬ 
inating the necessity of knowing antecedent probabilities for the establishment 
of statistical hypotheses. This principle must be given a brief discussion, for 
which a simple application of the principle may be chosen. 

Assume a set of n observations of a random variable u has been made; 
what is looked for is the probability distribution d(u) that controls the occur¬ 
rence of these values. This function is regarded as known apart from a 
parameter .s, so that it can be written in the form d(.s;w); an extension of the 
method to more than one unknown parameter is easily given and need not 
be studied here. The problem is to find the value of 8 that is made most 
probable by the n given data. The range of observed values may be divided 
into r intervals du\ . . . du r ; the number of observed values within an interval 
dui may be = n t . The forward probability w that, if d(s;u) is the distribution, 
the observed set of values will result, can be computed by the extended formula 
of Newton (sec footnote, p. 264), written for probabilities of the form d(s;u)du: 

w = L n (s]u h ni, . . . u r ,n r )d,ui l . . . du n r r 

fi t 

L n (s;u h n h . . . u r ,n r ) = d(s;iq) ni . . . d(s\u r ) nr (1) 

n = ni + . . . + n r 

3 “On the Mathematical Foundations of Theoretical Statistics,” in Philos. Trans. Roy. Soc. 
London , Series A, Vol. 222 (1922), p. 309; “Two New Properties of Mathematical Likeli¬ 
hood,” in Proc. Roy. Soc. London , Series A, Vol. 144 (1934), p. 285. 
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The function L n is called the likelihood function; the principle of maximum 
likelihood states that the actual value of s is the one for which the function 
L n is a maximum (with respect to variations of s). The function d(s;u) can 
thus be determined without the use of antecedent probabilities. The maximum 
of the forward probability replaces the unknown maximum of the inverse 
probability. 

Like all inferences by indirect evidence, this inference must be construed 
as an application of the rule of Bayes, and in a complete formulation it will 
include a reference to antecedent probabilities. There must exist an antecedent 
probability density q(s) for the values of s; the inverse probability density v n 
for a distribution d(s;u) is then found through a generalization of (5, §02), 
the function w n of this formula [apart from a constant factor, which drops 
out in (2)] being replaced by the function L n : 


VniuiVi, . . . u r n r ;s ) = 


q(s)L n (s;ui, n i, . . . u r ,n r ) 

X -tco 

q(*)Ln(it;ui } n lt . . . u r ,n r )ds 

-00 


( 2 ) 


The product dv" 1 . . . du r T drops out because it occurs in both numerator and 
denominator. Tin 1 distribution of the highest probability is the one for which 
the function v n (,s) has a maximum (the indication of the constants U\ . . . n r 
may be omitted). Obviously, the maximum of e n (.s) will in general not coincide 
with the maximum of L n (s). The use of the principle of maximum likelihood is 
therefore restricted to certain conditions. What are these conditions? 

A sufficient condition for the coincidence of the maxima of v n (s) and L n (s) 
is the assumption q(. s) = const., that is, the (‘quality of the antecedent prob¬ 
abilities. If the method were dependent on this assumption, the principle of 
maximum likelihood would be an equivalent of the principle of indifference 
and subject to the same criticism. 4 But this criticism would be too strict. 
Only if for all values of s the likelihood were regarded as proportional to the 
inverse probability would the method presuppose the constancy of q(s); but 
the maxima of L n (s) and v n (s) can coincide within a small interval ds if q{. s) 
is not constant. This is obvious when q(s) has its maximum for the same value a 
as L n (.s') J but even if this is not the case, the shift of the maximum of v n (s) 
can be small enough to leave the maximum within the interval ds. It follows, 
however, that the principle can bo used for the determination of the most 
probable distribution only when something is known about the antecedent 
probability q(s). Such knowledge is often available; antecedent probabilities 
are not probabilities of a specific kind, but are accessible to direct statistical 
determination like all other probabilities. 5 And the situation can be of such 

4 This view is held by M. 0. Kendall, “On the Method of Maximum Likelihood,” in Jour. 
Roy. Slat. Soc., Vol. 103 (1940), p. 388. 

6 For an illustration see J. Neyman, “Basic Ideas and Some Recent Results of the Theory 
of Testing Statistical Hypotheses,” in Jour. Roy. Stat. Soc., Vol. 105, Part 4 (1942), p. 298. 
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a kind that a rather inexact knowledge of q(s) is sufficient to justify the use 
of the principle of maximum likelihood. This is a general characteristic of 
all inferences supplying inverse probabilities: the power of the inference by 
indirect evidence derives from the fact that, if the antecedent probabilities 
are known within wide limits, the inverse probabilities are determinable 
within narrow limits. The considerations attached to (12, § 02) show how 
such results can be achieved. For this reason, the inference by indirect evi¬ 
dence can be used for the establishment of precise quantitative results even 
when knowledge of the antecedent probabilities is of a merely qualitative 
nature. 

As long as the principle of maximum likelihood is regarded as a principle 
determining one hypothesis as preferable to others, in the sense of being more 
likely than others, it is dependent on estimates of antecedent probabilities, 
like all other forms of the inference by indirect evidence. A way of overcoming 
this limitation will be discussed in § 88; however, it involves a renunciation 
of attempts to find the most probable hypothesis and reduces the principle 
to an improved version of the first method. But even if the principle can be 
freed from the use of antecedent probabilities, it includes other presupposi¬ 
tions that can be proved by inductive methods only: it presupposes a knowl¬ 
edge of the form of the function d(.s;w) and, furthermore, a knowledge that 
the observed values of u are independent of one another and admit of the 
use of the special theorem of multiplication for the computation of the func¬ 
tion L n . Corresponding qualifications hold for all similar principles employed 
in the theory of statistical inference, including Neyman’s methods, 6 which 
avoid the principle of maximum likelihood. 

I shall use the term advanced knowledge to denote a state of knowledge 
that includes a sufficient number of probabilities; it then follows that the 
second and third methods of ascertaining probabilities are applicable only 
in advanced knowledge. For primitive knowledge —a state of knowledge that 
does not include a knowledge of probabilities—the rule of induction is the 
only instrument for the ascertainment of probabilities. Incidentally, the rule 
of induction can also be applied in advanced knowledge and then assumes a 
somewhat different function, which will be studied in § 86; but for primitive 
knowledge it is the only available instrument. Many controversies about the 
legitimacy of probability methods arise from a confusion of these two states 
of knowledge: in advanced knowledge the whole technique of the calculus 
of probability is at our disposal and can be adduced to justify the methods 
employed, whereas primitive knowledge requires other means of justifying 
inductive inferences. 

The given analysis shows that only for statistical induction must a justifi¬ 
cation be required. The other two methods of ascertaining probabilities are 

6 This is also Neyman’s opinion; Jour. Roy. Stat. Soc ., Vol. 105, Part 4 (1942), p. 294. 
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covered by the logical system of the calculus of probability. The axiomatic 
construction of this calculus is proof that the rule of induction is the only 
nonanalytic principle necessary for the application of the calculus to reality, 
if the frequency interpretation of probability is assumed. The rule dot's not 
enter into the formal calculus, which is a deductive system like all other 
mathematical disciplines; however, it is required for the applied calculus of 
probability, which deals with empirical statistics. The assertability of applied 
probability statements is thus reduced to the problem of a justification of 
the rule of induction. 

The discussion of the problem must again be postponed, since it cannot 
be given without the construction of specific logical tools. The result may be 
emphasized, however, that the justification of induction is the only problem 
remaining for a theory of probability when the frequency interpretation is 
assumed. 

At this point I must discuss an opinion expressed by some mathematicians 
who deny the existence of a problem of application and thus wish to evade 
the difficulties of the problem of induction. 7 According to this opinion, the 
problem of application in the calculus of probability does not differ from 
that existing for other mathematical disciplines, in particular, geometry. 
Mathematically speaking, geometry is a deductive system based on certain 
axioms; if it is to hold for the physical world, the interpretation of the axioms 
must be so chosen that the axioms are true. Thus light rays may be chosen 
as an interpretation of straight lines, solid rods as defining spatial congruence, 
and so on, when it is proved that these physical objects satisfy the relations 
postulated in the axioms. The problem of the application of geometry is thus 
solved by the method of adequate selection of objects. 

It was mentioned above that similar considerations hold for the application 
of the axioms of probability; and the result was expressed in the requirement 
that any interpretation of the P-symbol must be an admissible interpretation. 
The analysis shows, however, that for the interpretation of the calculus of 
probability there arises a specific problem that has no analogue in the inter¬ 
pretation of other mathematical disciplines: the selection of an interpretation 
before we know whether the interpretation is an admissible one. In this form 
we can state the problem of induction. If the sequence under consideration 
has a limit of the frequency at p, it certainly constitutes an admissible inter¬ 
pretation of the statement P(A,B) — p. What creates the difficulty, however, 
is the fact that we must use the sequence as an interpretation before we know 
whether it has a limit of the frequency at p, even before we know whether 
it has a frequency limit at all. This specific difficulty makes the problem of the 
application of probability statements unique. 

7 This opinion was stated by R. von Mises, Vorlesungen aus dem Gebiete der angewandten 
Mathematik, Vol. I: Wahrscheinlichkeitsrechnung (Leipzig, 1931), p. 21; and by E. Tornier, 
“Grundlagen der Wahrscheinlichkeitsrechnung,” in Acta Math ., Vol. 60 (1933), p. 313. 
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It is true that difficulties of this kind arise in the application of other 
mathematical disciplines; thus the verification of the axioms of geometry 
involves measurements that supply only approximative results, and the 
assumption is made that on repetition the measurements will converge 
toward a certain limit. But it is clear that this assumption is not of a geo¬ 
metrical nature; it rather represents an inductive inference and thus falls in 
the domain of investigations concerning probability sequences. The example 
illustrates the fact that, whim any mathematical discipline is applied to 
physical reality, probability inferences intervene; and that therefore all appli¬ 
cation problems, in addition to the method of adequate selection of objects, 
include methods used in the application of the calculus of probability. The 
solution of the problem of induction is thus shown to be the necessary pre¬ 
requisite for a solution of any problem of application. 

§ 71. Attempts at a Single-Case Interpretation 
of Probability 

After the discussion of the frequency meaning of probability, the investigation 
must turn to linguistic forms in which the concept of probability refers to an 
individual event. It is on this ground that the frequency interpretation has 
been questioned. Some logicians have argued that such usage is based on a 
different concept of probability, which is not reducible to frequencies. Is the 1 
existence of two disparate concepts of probability an inescapable consequence 
of the usage of language? 

At first sight, indeed, it would seem that a probability applied to a single 
case has nothing to do with a frequency. We say, “It is probable that it will 
rain tomorrow”; “It is improbable that Julius Caesar was in Britain”; and 
so on, thus referring probability to a single event. How does it help to know 
that on a certain percentage of days of certain meteorological conditions it 
will rain, when we wish to know the probability for rain on one particular day? 
Similarly, the example of Julius Caesar\s stay in Britain has often been quoted 
as an instance of a probability that makes a frequency interpretation impos¬ 
sible. We must analyze the various meanings that can be suggested for such 
a second concept of probability. 

It was explained above that the formal system of the probability calculus 
does not distinguish one interpretation from others as the only admissible 
one; formally speaking, therefore, the possibility of the occurrence of several 
interpretations in practical applications cannot be denied. The analysis of 
the problem must be carried through from other viewpoints. 

What must be asked is whether an interpretation is adequate to account 
for the use of probabilities as degrees of reliability or as instruments suitable 
for an evaluation of predictions. For it is the predidional value that makes 
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probability statements indispensable tools of conversational as well as of 
scientific language. We know that a specific event either will happen or will 
not. The degree of probability, therefore, will be of no use after the truth 
about the occurrence of the event is known: probability is used as a substitute 
for truth so long as the truth is unknown. If the event is to happen in the 
future, the degree of its probability qualifies the reliability of a prediction; 
and if the event belongs to the past we can also regard the probability as the 
measure of the quality of a prediction, namely, of a prediction about possible 
future verifications of a past event the truth value of which is unknown. 
The criterion for the justification of an interpretation lies in its adequacy for 
the purpose of prediction. 

This criterion immediately rules out an interpretation like the geometrical 
interpretation introduced in § 40. That probability can be used as a measure 
of areas is a logically interesting fact, but no one will be inclined to use the 
term “probability” instead of “area” for surveying or similar purposes. Only 
with respect to continuous probability sequences (§ 4G) is the measure inter¬ 
pretation of probability important for applications; but the interpretation 
is then an analogue of the frequency interpretation, replacing a counting of 
elements by the measuring of a length, and its discussion thus belongs in the 
frame of the analysis of the frequency interpretation. 

The first interpretation of the probability of single events is the degree of 
expectation with which an event is anticipated. The feeling of expectation 
certainly represents a psychological factor the existence of which is indispu¬ 
table; it even shows degrees of intensity corresponding to the degrees of prob¬ 
ability. Difficulty, however, arises from the fact that the degree of expectation 
varies from person to person and depends on more factors than the degree of 
the probability of the event to which the expectation refers. Apart from the 
probability of an event, emotional associations will influence the feeling of 
expectation. If it is a desirable event, as, for instance, the passing of an 
examination, optimistic persons will anticipate it with too-certain expecta¬ 
tions, whereas pessimistic persons will think of it in terms of too-uncertain 
expectations. 

The contrary holds for undesirable events. The discrepancy between expec¬ 
tation and degree of probability is expressed in the phenomenon of fear. In a 
thunderstorm there is a certain low probability that we might be struck by 
lightning; if it strikes our immediate surroundings we have no means of 
protecting ourselves against the violence of the electrical discharge, which 
does not keep to prescribed paths. But we need not be afraid, as the prob¬ 
ability of this event is extremely low—lower, for instance, than the prob¬ 
ability of a traffic accident. If, in spite of this fact, many people are afraid of 
thunderstorms, their behavior demonstrates that their feeling of expectation 
is much higher than would correspond to the probability of being struck by 
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lightning. The emotional tension caused by the roaring thunderbolts may 
make the feeling of expectation grow far too intense. In fact, fear may be 
defined as the exaggeration of the feeling of expectation concerning the 
probability of an undesirable event. Of course, it would be a mistake in the 
other direction to exclude the' possibility of being struck by lightning; we 
should try to reduce the feeling of expectation to the proper degree. But 
few persons succeed in the attempt. 

Corresponding considerations hold for the opposite of fear. Hope is the 
exaggeration of the feeling of expectation concerning the probability of a 
desirable event. The alluring call of hope often persuades us to reckon seriously 
with probabilities that would be discounted after sober analysis. The delusion 
concerns not only events influencing the' life of an individual, but anticipations 
of social conditions, of the progress of civilization, and of a better world in 
future. We are only too willing to believe that what we desire' will come true. 
This expectational illusion on emotive grounds plays a great role in political 
ideology. 

These considerations show why the feeling of expectation dot's not represent 
the meaning of the probability concept of applications. We rather conceive 
probabilities as numerical values holding for the physical world to which our 
feeling of expectation must be adjusted, as measures of what the intensity 
of our expectation should be, but not of what it is. We try to regulate the 
feeling of expectation to the correct degree by means of training and adapta¬ 
tion. Much of the wisdom of the so-called experienced person consists in being 
well versed in this art. But if there is an adjustment of expectation to the 
correct degree, then' must exist an interpretation of probability independent 
of the feeling of expectation. 1 

The second interpretation of the probability of the single case is connected 
with the principle of indifference. It was mentioned in § 08 that the principle 
has played a role in the history of the theory of probability. In order to 
maintain the principle of indifference against criticisms of the kind presented 
in § 68, an interpretation of probability was developed that defines the 
meaning of probability in such a way that the principle seems to be legitimate. 

The interpretation is based on the logical principle of retrogression, which 
is involved in the theory of meaning. It states that the meaning of a sentence 
is given by the method of its verification. 2 Since, according to the adherents 

1 Apart from this consideration, there are other reasons why the feeling of expectation 
cannot be carried through as an adequate interpretation of the meaning of the axiomatic 
system. Even if we admit that within certain limits the intensity of expectation corresponds 
to the degree of probability, it can by no means be asserted that the correspondence still 
holds in dealing with more complicated forms of probability relations. Thus we cannot say 
that axioms of the calculus of probability represent laws of the feeling of expectation, for 
example, that the theorem of multiplication may be applicable to the intensity of the feeling 
of expectation. Psychological laws are much too complicated to fit the schema of mathe¬ 
matical probabilities. 

2 See EP, p. 49. 
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of the principle of indifference, we determine a probability by counting the 
terms of an exclusive disjunction in which we have no reason to prefer one 
term, the principle of retrogression supplies the result that the meaning of 
probability is given by reference to such a disjunction. We thus arrive at a 
retrogressive interpretation of probability. 

For instance, that the probability of obtaining face 6 with a die* is \ means, 
according to this interpretation, that face 6 is a term of a disjunction of six 
terms and that we have no reason to prefer one of the terms. 3 It is obvious 
that this interpretation justifies the use of the principle of indifference, since 
then the probability statement states no more than what is assumed as the 
premise* of that principle. Thus we have hen* a nonfrequency interpretation 
of probability that is compatible with the principle of indifference (see p. 354). 

it is equally obvious, however, that in this interpretation the probability 
statement loses its predietional value*. Why should we bet on the occurrence 
of the event “non-O” rather than on the occurrence of “(>”? The retrogressive 
interpretation narrows the meaning of the probability statement in such a 
way that the assertion of the statement is justified on the basis of the prin¬ 
ciple of indifference; but in the transition from probability statements to 
predictions, or advices to action, there reappears the very problem that the 
retrogressive interpretation was intended to evade and that the principle of 
indifference cannot solve*. 

The third interpretation of the* probability of the single ease was con¬ 
structed with the intention of escaping the difficulties of both the frequency 
and the retrogressive* interpretation. Its proponents insist that we must 
renounce defining probability in terms of other concepts. Probability, it is 
argued, is not reducible to certainty, be it the certainty of frequency state¬ 
ments or that of statements about terms of a disjunction. Instead, probability 
is regarded as a primitive concept that is not capable of further definition. 
According to this conception, the statement that the probability of an expected 
event is l has a meaning of its own, which is comparable to the meaning of 
primitive* notions of logic; and this meaning cannot be interpreted as a fre¬ 
quency or a report about terms in a disjunction. The conception is sometimes 
stated in the* form that probability is a rational belief , that the laws of prob¬ 
ability constitute a quantitative logic based on a self-evidence comparable to 
that of ordinary logic. 4 However, adherents of the primitive*-concept interpre- 

3 So far as I know, the first to use this interpretation was C. Stumpff, in an article pub¬ 
lished in Ber. d. bayer . Akad., philos. Kl ., 1892. The interpretation has been taken up in our 
day by some logicians under the influence of ideas of Ludwig Wittgenstein, Tractatus logico- 
philosophicus (London, 1922), p. 113; thus by A. Waisman, in Erkenntnis , Vol. I (1930), 
p. 229. 

4 This concept is represented by the ideas of J. M. Keynes, A Treatise on Probability 
(London, 1921), and was continued by Harold Jeffreys, Theory of Probability (Oxford, 1939). 
I cannot say to what extent it is also present in "the ideas of Carnap and others about 
confirmation (see § 88). 
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tation do not always clearly distinguish it from the retrogressive interpre¬ 
tation; some logicians vacillate between the two interpretations, depending 
upon what they wish to prove. 

The difficulties of the primitive-concept interpretation appear so over¬ 
whelming that it is hard to understand how a logician can commit himself 
to it. First, the degree of probability remains unverifiable in terms of the 
event predicted. When the event expected with a probability | is observed, 
does this observation verify the probability statement? Obviously not, since 
the nonoccurrence of the event, too, is compatible with the probability state¬ 
ment. And how could one distinguish between a probability, say, of f, and 
one of f ? The numerical value of a probability cannot be ascertained from one 
observation. We do not escape the predicament by restricting probability 
statements to relations of order stating merely that a probability is higher 
or lower than another (see p. 380); such relations cannot be verified by one 
observation either. 

It is sometimes argued that verification of the degree of probability is 
obtained, not by observation of the event, but by other methods, such as 
are used in the principle of indifference. Thus it is said that the probability } 
for the face of the die is verified by the existence of the six faces of the die. 
But with this argument all the difficulties of the principle of indifference 
rise anew; it seems incomprehensible that absence of reason to the contrary 
should be a guaranty of equal probabilities. To use the principle of indifference 
for verification of statements of probability degrees is not permissible when 
probability has a meaning of its own; it is permissible only for a retrogressive 
interpretation. In resorting to the principle of indifference for verification of 
probability statements, adherents of the primitive-concept interpretation 
adopt a method of defense that is suitable only for a retrogressive interpre¬ 
tation. They forget that with such a defense the primitive-concept interpre¬ 
tation is abandoned and the probability statement loses its predictional value. 

Second—and this point is closely connected with the first—even if the 
probability statement is regarded as having a meaning of its own, not subject 
to a retrogressive interpretation, there is no way of explaining its usefulness 
for predictions. We are dealing here with a question of meaning, not of 
assertability. Even if it were possible to justify the assertion of a probability 
statement in this interpretation, there remains the question of whether its 
meaning is of such a kind that the statement can be used as a guide for pre¬ 
dictions. Assume, for a moment, that the truth of the probability statement 
has been established. Why, then, is it advisable to predict the event of the 
greater probability? If the probability statement cannot be verified through 
the occurrence or nonoccurrence of future events, it does not state anything 
concerning future events and thus cannot be used as an advice for predictions. 
In other words, what difference does it make if a man, instead of assuming 
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the event B to happen, prefers to assume that the event B will happen? 
Suppose we could prove that the probability of B is = f. What kind of later 
experience would demonstrate to a man who had assumed B to happen that 
he had acted unwisely? I do not think that the man would recognize a proof 
that appealed to rational belief and attempted to evoke his faith in the rules 
of probability. The only convincing argument would be an experience he 
might undergo with respect to the occurrence or nonoccurrencc of the event. 
But if some experience proved to him that he was wrong, it would be a verifi¬ 
cation of the probability statement, the meaning of which would thus no 
longer be primitive, but would be translated into observable properties of 
future events. 

To state the argument briefly: if probability has something to do with the 
reliability of predictions, the probability statement must be verifiable in 
terms of the occurrence of the event predicted; otherwise the statement will 
be empty so far as predictions are concerned. The frequency interpretation 
satisfies this condition inasmuch as it verifies a degree of probability through 
repeated occurrence of the event. If, however, the meaning of the probability 
statement refers to a single event, it is impossible to verify the statement in 
terms of the occurrence of the event; and therefore the statement "has no 
predictional value. 

Adherents of the primitive-concept interpretation usually maintain that 
the rules of probability constitute a part of logic, and that if human behavior 
is to be regarded as reasonable it must follow these laws as well as it follows 
the laws of deductive logic. This conception must be analyzed more closely. 
First, the laws of probability will then include a rule for the ascertainment 
of probability, which is usually constructed as a version of the principle of 
indifference; second, however, they will include the axioms of the calculus 
of probability, since these axioms cannot be regarded as analytic if the fre¬ 
quency interpretation is rejected. We thus are faced by a comprehensive 
system of synthetic statements that are asserted to be a part of logic. 

The primitive-concept interpretation, therefore, leads to a conception of 
logic for which logic is not analytic throughout and therefore is not empty. 
Logic is conceived, rather, as a science revealing the ultimate laws of the 
physical world, not by means of sense perception but by a rational insight 
into the nature of the universe. Such a conception of logic has found its 
classical expression in Kant’s philosophy of the synthetic a priori , which 
asserts that there exists a knowledge that is synthetic and yet is independent 
of experience and strictly certain. Knowledge of this kind is characterized 
by synthetic self-evidence, analogous to the self-evidence of analytic statements. 
In fact, the rational belief that is alleged to be the basis of the ascertainment 
of a probability, or of the law^s of probability, represents a revival of the 
rationalism of Leibniz and Kant, of the v6rites de raison as well as of the 
synthetic a priori . 
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This conception of logic cannot be discussed in detail here. Suffice it to 
say that the claims of a synthetic a priori cannot be recognized by an empiricist 
philosophy; that the development of science since the time of Kant has proved 
his philosophy of the synthetic a priori to be untenable; that such a philosophy 
must be classified as a remnant of the speculative metaphysics of earlier 
phases of civilization and is incompatible with modern scientific method. 
In fact, the admission of synthetic self-evidence would spell the breakdown 
of scientific philosophy. Trust in synthetic; self-evidence means belief that 
nature' must conform to reason; but the history of knowledge has shown 
that what we call “reason” is the' product of a physical environment of rather 
simple structure and that with the widening scope of empirical knowledge 
the so-called laws of reason have changed. There is no such thing as synthetic 
self-evidence; the' only admissible' source's e>f knowledge are' sense perception 
and the analytic self-evidence of tautologies. 

With this result we must abandon all attempts at finding a satisfactory 
interpretation of probability statements that restricts their meaning to a 
single case. That the frequency interpretation can supply an empiricist solu¬ 
tion of the probability of the single case' will be shown in § 72. 

§ 72. The Frequency Interpretation of the Probability 
of the Single Case 

The analysis of meaning has suffered from too close an attachment to psycho¬ 
logical considerations. The meaning of a sentene*e has been identified with the 
mental images associated with the utterance' of the sentence. Such conception 
leads to meanings varying from person to person; and it will not help to find 
the meaning that a man would adopt if he had a clear insight into the impli¬ 
cations of his words. Logic is interested not in what a man means but in what 
he should mean, that is, in the meaning that, if assumed for his words, would 
make his words compatible' with his actions. 

When the meaning of probability statements about single; events is analyzed 
according to this objective criterion, it is found that the frequency interpre¬ 
tation can be applied to this case; too. True, we must renounce a reconstruction 
of subjective psychological intentions; but, since we found that it is not 
possible to translate the expectation associated with the anticipation of a 
future event into a logical category, we shall welcome the construction of 
a logical substitute that can take over the function of a probability of the 
single case without being such a thing in the verbal sense. 

Assume that the frequency of an event B in a sequence is = f. Confronted 
by the question whether an individual event B will happen, we prefer to 
answer in the affirmative because, if we do so repeatedly, we shall be right 
in $ of the cases. Obviously, we cannot maintain that the assertion about 
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the individual event is true. In what sense, then, ean the assertion he made? 
The answer is given by regarding the statement not as an assertion but 
as a posit. We do not say that B will oecur, but we posit B. Wo do so if 
P(B) > 2 ; otherwise we posit B. The word “posit” is used hen? in the same 
sense as the word “wager” or “bet” in games of chance. When we b«'t on a 
horse we do not want to say by such a wager that it is true that the horse 
will win; but we behave as though it were true by staking money on it. 
A posit is a statement with which we deal as true , although the truth value is 
unknown. We would not do so, of course, without some reason; we decide 
for a posit, or a wager, when it seems to be the best we can make. The term 
“best” occurring here has a meaning that can be numerically interpreted; 
it refers to the posit that will be the most successful when applied repeatedly. 

If we wish to improve* a posit, we must make a selection S such that 
P(A.S,B)> P(A,B)d If we now posit B only in the case A. 8 and omit 
a posit in the case A .8, we obtain a relatively greater number of successes 
than by the original posit. It is even more favorable to construct the selection 
so that at the same time P(A .8,B) is < We then always posit B in the 
case A . 8. We can thus make a posit for each element, of the original sequence 
and obtain a greater number of successes. The procedure may be called the 
method ! of the double posit. 

It should be noticed that we cannot improve a posit without knowing a 
selection S that leads to a greater probability. If we were to posit arbitrarily 
sometimes B , sometimes B, we would in general construct a selection S for 
which P(A.SyB) = P(A,B) = P(A.8,B ), that is, a selection of the domain 
of invariance. Positing B in the case A .S would then lead to the same rela¬ 
tive number of successes as in the main sequence, whereas positing B in the 
case A . 8 would lead to a smaller rat io of successes. 

In dealing with exclusive disjunctions B\ V . . . V B r of more than two 
terms, if we art' compelled to posit only one of the terms (if a posit B 2 V . . . V B r 
is impossible because of practical reasons), the B k that carries the greatest 
probability will be the best posit. In this case, therefore, the probability \ 
no longer represents a critical value. 

The method of positing serves to utilize probability statements for decisions 
in regard to single cases. It plays an important role in all practical applica¬ 
tions. The merchant who stores a great amount of merchandise for the season, 
the farmer who wants to get in his crop, the physician who prescribes a cure— 
all must make decisions, though they know only probability statements about 
the factors determining success: the merchant about the prospective demand, 
the farmer about the prospective weather, the physician about the illness 
that presumably confronts him. They make posits by assuming the occurrence 

1 If we have P(A . S,B) < P(A,B), the selection S will have the desired property because 
it then follows from the rule of elimination that P(A .S.B) > P(A.B). See the discussion 
following (lib, § 19). 
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of the events that they consider to be the most probable, according to their 
experience. Each endeavors to improve his posit by increasing the probability 
through a more precise analysis of the actual conditions, that is, by making 
a selection S such that a greater probability will hold for the subsequence 
determined by S, and such that, if possible, even the method of the double 
posit becomes applicable. The merchant will explore the market situation 
more thoroughly, the farmer will study the official weather forecast, the 
physician may try to analyze the condition of his patient more exactly by 
taking X-rays. There is no instance in which certainty is reached; only a very 
high probability is attainable. 

It is not necessary for the construction of a sequence that similar cases 
repeat themselves. If we must make a decision today about the prospective 
weather, tomorrow about the state of an illness, the day after tomorrow 
about some financial transaction, and if we always posit the most probable 
case, our decisions represent a sequence in which the probability changes 
from element to element, that is, a sequence that belongs to the type of the 
Poisson sequence (§ 56). Here only one horizontal sequence of the Poisson 
lattice is realized. But this sequence suffices to obtain a statistical success. 
Since probabilities usually have values around 1 or 0, we shall be able to 
apply the method of the double posit. The sequence of the numerous actions 
of a single day—when we turn on the faucet, hoping that the water will run; 
when we call the telephone number of a friend, hoping to obtain a connection; 
and so on—represents a rather long Poisson sequence. The statistical justifi¬ 
cation of the posit, therefore, is applicable to the actions of a single person. 

If we are asked to find the probability holding for an individual future 
event, we must first incorporate the case in a suitable reference class. An 
individual thing or event may be incorporated in many reference classes, 
from which different probabilities will result. This ambiguity has been called 
the problem of the reference class. Assume that a case of illness can be char¬ 
acterized by its inclusion in the class of cases of tuberculosis. If additional 
information is obtained from an X-ray, the same case may be incorporated 
in the class of serious cases of tuberculosis. Depending on the classification, 
different probabilities will result for the prospective issue of the illness. 

We then proceed by considering the narrowest class for which reliable statistics 
can be compiled. If we are confronted by two overlapping classes, we shall 
choose their common class. Thus, if a man is 21 years old and has tuberculosis, 
we shall regard the class of persons of 21 who have tuberculosis. Classes that 
are known to be irrelevant for the statistical result may be disregarded. A class 
C is irrelevant with respect to the reference class A and the attribute class B 
if the transition to the common class A . C does not change the probability, 
that is, if P(A.C,B) = P(A,B). For instance, the class of persons having the 
same initials is irrelevant for the life expectation of a person. 



§ 72. FREQUENCY INTERPRETATION OF THE SINGLE CASE 375 

We do not affirm that this method is perfectly unambiguous. Sometimes it 
may be questioned whether a transition to a narrower class is advisable, 
because, perhaps, the statistical knowledge regarding the class is incomplete. 
We are dealing here with a method of technical statistics; the decision for a 
certain reference class will depend on balancing the importance of the predic¬ 
tion against the reliability available. It is no objection to this interpretation 
that it makes the probability constructed for the single case dependent on 
the state of our knowledge. This knowledge may even be of such a kind that 
it does not determine one class as the best. For instance, we may have reliable 
statistics concerning a reference class A , and likewise reliable statistics for a 
reference class C, whereas we have insufficient statistics for the reference 
class A . C. The calculus of probability cannot help in such a case because 
the probabilities P(A,B) and P(C,B) do not determine the probability 
P(A . C,B). The logician can only indicate a method by which our knowledge 
may be improved. This is achieved by the rule: look for a larger number of 
cases in the narrowest common class at your disposal. 

Whereas the probability of a single case is thus made dependent on our 
state of knowledge, this consequence does not hold for a probability referred 
to classes. If the reference class is stated, the probability of an attribute is 
objectively determined, though we may be mistaken in the numerical value 
we assume for it on the basis of inductions. The probability of death for men 
21 years old concerns a frequency that holds for events of nature and has 
nothing to do with our knowledge about them, nor is it changed by the fact 
that the death probability is higher in the narrower class of tuberculous 
men of the same age. The 4 dependence of a single-case probability on our state 
of knowledge 4 originates from the impossibility of giving this concept an inde¬ 
pendent interpretation; there exist only substitutes for it, given by class 
probabilities, and the choice of the substitute depends on our state of knowl¬ 
edge. My thesis that there exists only one concept of probability, which applies 
both to classes and to single cases, must therefore be given the more precise 
formulation: there exists only one legitimate concept of probability, which 
refers to classes, and the pseudoconcept of a probability of a single case 
must be replaced by a substitute constructed in terms of class probabilities. 

The substitute is constructed by regarding the individual case as the limit 
of classes becoming gradually narrower and narrower. The method is justified 
in the theory of probability by the fact that, as explained above, we obtain 
a greater number of successes if we employ the probability of the subsequence 
P(A.S f B) and not the probability P{A,B) as the basis of our posits. A re¬ 
peated division of the main sequence into subsequences will lead to progres¬ 
sively better results as long as the probability is increased at each step. 
According to general experience, the probability will approach a limit when 
the single case is enclosed in narrower and narrower classes, to the effect 
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that, from a certain step on, further narrowing will no longer result in notice¬ 
able improvement. It is not necessary for the justification of this method 
that the limit of the probability, respectively, is = 1 or = 0, as the hypothesis 
of causality assumes. Neither is this necessary a priori ; modern quantum 
mechanics asserts the contrary. 2 It is obvious that for the limit 1 or 0 the 
probability still refers to a class, not to an individual event, arid that the 
probability 1 cannot exclude the possibility that in the particular case con¬ 
sidered the prediction is false. Even in the limit the substitute for the prob¬ 
ability of a single case will thus be a class probability, and we shall always 
depend on the method of positing. 

Besides choosing a suitable reference class, we must also choose a sequence 
into which the individual case considered is to be incorporated. This choice 
is usually less difficult than that of the reference class because the frequency 
will be the same for most sequences that can be reasonably chosen. We often 
follow the time order of the events observed or of the observations made. 
One of the rules to be required is that a knowledge of the attribute of the 
individual case should not be used for the construction of the order of the 


sequence. (See the remarks on random sequences in § 30.) 

There are, however, instances in which the choice of the sequence is con¬ 
nected with ambiguities, as in lattice arrangements where the horizontal and 
the vertical sequences converge to different limits. If a particular element 
y k i of the lattice is considered, the probability assumed for it depends on 
whether further observations concern the horizontal or the vertical sequence 
to which it belongs. An illustration is offered by the rule of succession (22, § 62), 

71 - f* 1 

the probability value —refers to a column and is applicable only 

if a sequence in the vertical direction, supplied by a set of horizontal initial 
sections of a certain kind, represents the experiences to be envisaged. If, on 
the contrary, the horizontal sequence to which y k i belongs is continued, the 
probability value characterized by the maximum inverse probability v n is 


p = 1, a value which is then preferable to the value 


n + 1 


. This result 


means that if many horizontal sequences of this kind are considered, most of 
them will have a limit of the frequency close to 1, which is thus the most ap¬ 
propriate value to be transferred to the element y k i. The illustration makes it 
obvious that the probability assumed for the single case depends on the mode 
of procedure by which the single case is incorporated in a sequence. 

The solution offered here for the probability of the single case is essentially 
different from the interpretations discussed in § 71. I regard the statement 
about the probability of the single case, not as having a meaning of its own, 


2 See H. Reichenbach, “Das Kausalprinzip in der Physik,” in Naturwissenschaften , Vol. 19 
(1931), p. 716; and Philosophic Foundations of Quantum Mechanics (Berkeley, 1944), § 1. 
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but as representing an elliptic mode of speech. In order to acquire meaning, 
the statement must be translated into a statement about a frequency in a 
sequence of repeated occurrences. The statement concerning the probability 
of the single case thus is given a fictitious meaning , constructed by a transfer 
of meaning from the general to the particular case . The adoption of the fictitious 
meaning is justifiable, not for cognitive reasons, but because it serves the 
purpose of action to deal with such statements as meaningful. 

For a better understanding of the solution, consider the analogous solution 
of a problem of deductive logic. When we speak of a necessary synthetic or 
physical implication in an individual case, we mean that the case is an instance 
of a general law. The statement, “If you press this button, the bell will ring”, 
expresses a physical necessity and thus a reasonable implication. And yet 
by this classification of the statement we mean only that the same adjunctive 
implication holds in all similar cases. Physical necessity is interpretable in 
terms of “always”. It was explained in the discussion of the general impli¬ 
cation (3, § (>) that this interpretation involves a transfer of meaning from 
the general to the particular case. A similar transfer is characteristic for prob¬ 
ability statements. The statement, “If you press this button, the bomb will 
hit the target with the probability derives its meaning from a reference 
to generality, just as does the statement about the bell. The only difference 
is that the probability statement indicates a frequency relation that does 
not hold for all cases, but is restricted to a certain fraction of cases. 

This interpretation of probability statements is complicated by the fol¬ 
lowing peculiarity. If we have an implication (A D B ), we can add an arbi¬ 
trary term in the implicans, that is, we can derive the implication (A C D B). 
If we have a probability implication (A -a- B) y however, the addition of a term 

p 

in the implicans will, in general, lead to a different degree of probability, that 
is, to a probability implication (A. C -a- B) where q is different from p. This 

Q 

is why the choice of the reference class is easily made for a general implication, 
whereas it is difficult to make it for a probability implication. Once a class A 
is found such that (A D B) holds, we know that if x t e A we shall have y t e B; 
it does not matter to what other classes the event x x belongs. For a probability 
implication there is no such simple relation. We must be aware of the possibility 
that, if belongs to both A and C, the reference to the common class A. C 
may lead to a value of the probability different from the one resulting for the 
reference class A. 

Therefore, we can ask only for the best reference class available, the refer¬ 
ence class that, on the basis of our present knowledge, will lead to the greatest 
number of successful predictions, whether they concern hits of bombs, cases 
of disease, or political events. If no statistics are available for the common 
class A.C, we shall base our probability calculations on the reference class A, 
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and must renounce the improvement in the success ratio that might result 
from the use of the reference class A . C in combination with the method of 
the double posit. Such a procedure seems reasonable if we realize that nar¬ 
rowing the reference class means nothing but increasing the success ratio, 
and that there is no reference class that permits the prediction of a single case. 
This goal, which could be reached if we had a knowledge of synthetic logical 
implications combining past and future events, is unattainable if probability 
implications are all that we have to connect the past with the future. 

We must renounce all remnants of absolutism in order to understand the 
significance of the frequency interpretation of a probability statement about 
a single case. But there* is no place for absolutism in the theory of probability 
statements concerning physical reality. Such statements arc* used as rules of 
behavior, as rules that determine the most successful behavior attainable in 
a given state of knowledge. Whoever wants to find more in these statements 
will eventually discover that he has been pursuing a chimera. 


§ 73. The Logical Interpretation of Probability 

In order to construct a logical form for probability statements concerning 
single cases, it is advisable to introduce a change in the logical classification 
of probabilities. In the frequency interpretation, a probability is regarded 
as a property of a sequence of events. Correspondingly, the statement about 
the probability of a single cast* is regarded as stating a property, though 
fictitious, of an individual event. It is possible, however, to go from events 
to sentences about events, and to regard a probability, not as a property of 
the event, but as a property of the sentence about the event. Instead of saying, 
for instance, that the probability of obtaining face 6 with a die is = we 
can say that the probability of the sentence, “Face G will turn up”, is = £. 
By this transition, probability is made a rating of propositions; and prob¬ 
ability statements belong not in the object language but in the metalanguage. 

The dual possibility of conceiving probability was first seen by George 
Boole, 1 who wrote, “There is another form under which all questions in the 
theory of probabilities may be viewed; and this form consists in substituting 
for events the propositions which assert that those events have occurred, or 
will occur”. I shall introduce the term logical interpretation of probability for 
this conception; the conception previously used will be called object interpre¬ 
tation. The logical interpretation offers the advantage that the probability 
attached to the single case assumes the form of a truth value of a proposition 
or, rather, since the proposition can be maintained only in the sense of a posit, 
of the truth value of a posit. We shall use the term weight for the truth value 
of the posit; the probability of the single case, therefore, is regarded, in the 
1 The Laws of Thought (London, 1854), p. 247. 
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logical interpretation, as the weight of a posit. A posit the weight of which is 
known is called an appraised posit. 

When we say that probability assumes the? form of a truth value, we use 
the latter term in a wider sense than usual. Classical logic is two-valued; it 
knows only the two truth values true and false. In regarding probability as 
a truth value we construct a multivalued logic, differing from other such 
logics in that it is a logic with a continuous scale of truth values ranging 
from 0 to 1. The formal construction of this probability logic will be carried 
through in chapter 10. 

An analysis of language reveals that many of its elements can be under¬ 
stood only from the viewpoint of probability logic. We often use sentences 
referring to individual events that are not asserted to be certainly true; and 
we indicate our truth evaluation by words like “probably”, “likely”, or 
“presumably”. The Turkish language possesses a particular mood of the 
verb expressing that a statement about a past event is not maintained as 
certain, but only probable. 2 Such forms of language are expressions of the 
predicate “weight”. 

That the weight referred to in such sentences is reducible to a frequency 
meaning is demonstrable, not only for sentences for which the reference 
class is obvious—as, for example, “It will probably rain tomorrow”—but also 
for statements that do not easily lend themselves to statistical interpretation. 
For instance, the statement that Julius Caesar was in Britain must be regarded 
as a posit having a certain weight that is translatable into a frequency state¬ 
ment. When we look for the methods by which the weight is ascertained, we 
usually discover the statistical origin of the weight. Thus, in order to ascertain 
the reliability of the statement about Caesar\s stay in Britain, we investigate 
the number of chroniclers who report such a fact; and we measure the reli¬ 
ability of the individual chronicler by the number of his reports that are 
corroborated by reports of other writers. 3 

True, we often prefer an intuitive appraisal of the weight to a statistical 
enumeration. For instance, we judge intuitively from his presentation whether 
a writer is a reliable authority. But though an intuitive appraisal may some¬ 
times make a rationalized inference unnecessary, it does not invalidate the 
inference. Thus the inferences of the meteorologist in regard to the weather 
of the next day are not made false by the fact that the intuitive appraisal 
of a sailor may arrive at the same result by simpler methods. We prefer an 
intuitive appraisal to a statistical determination only if statistics are incom¬ 
plete. The human mind is fortunate in being endowed with the ability of 
intuitive appraisal; in many cases the use of this talent leads to a better 

2 See ESL, p. 338. 

3 In a more precise analysis the inference must be interpreted through explanatory in¬ 
duction (see § 84); my presentation applies a simplification. 
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determination of probability values than the compilation of incomplete 
statistics. 

The analysis of mental processes during an intuitive action must be left 
to the psychologist—the logician can ask only for the rational reconstruction 
of an action. Such a reconstruction is given in the statistical interpretation. 
For instance, wo must regard the scientific inference of the meteorologist 
as a rational reconstruction of what the sailor does intuitively. In the same 
sense, the statistical interpretation of the weight of sentences about individual 
events, like Caesar’s stay in Britain, must be regarded as the rational recon¬ 
struction of an intuitive estimate of the weight. That, on the other hand, 
statistics are necessary not only in ascertaining the weight but also in estab¬ 
lishing the meaning of the probability statement is apparent from the fact 
that we use statistics for further verification of the statement, in the form of 
a verification of the statistical predictions included in the statement. 

It has been argued 4 that probability statements of the kind considered 
are not quantifiable, that they are intended to state only a relation of order 
expressible by the terms “more probable” or “less probable”. It is true that 
we often restrict ourselves to the statement of order relations. The verifica¬ 
tion, obviously, will bo easier than that of quantitative relations because the 
statement of order states less than that of quantity. It would be a serious 
mistake, however, to believe that the employment of relations of order is a 
proof that quantitative relations cannot be established. When relations of 
order are asserted, we are often able to supplement them, at least, by rough 
estimates of numerical probabilities. This ability is demonstrated in the habit 
of betting. We bet on the outcome of a boxing match, a scientific experiment, 
a political election, or a war, expressing the numerical value of the appraised 
degree of probability by the height of our stakes. The rational reconstruction 
of such bets would lead to statistical evaluations. 

The logical interpretation, which was defined for the probability of a single 
case, can be extended to the case where a sequence is used as the object of 
interpretation; it then leads to the consideration of sequences of propositions, 
or propositional sequences . Both interpretations thus have correlates in logical 
interpretations. The logical interpretation of sequence probabilities, however, 
has only a theoretical interest, and will be discussed later from that point 
of view. The practical importance of the logical interpretation springs from 
its application to the single case, because, in this application, probability 
assumes the function of a substitute for a truth value. 

Because of its analogy with the concept of a statement of known truth 
value, the concept of appraised posit is indispensable for the understanding 
of language. It defines the logical category under which probability statements 
concerning individual cases are to be subsumed, and allots to such statements 
4 By J. M. Keynes, A Treatise on Probability (London, 192]), p. 84. 



381 


§ 73. THE LOGICAL INTERPRETATION OF PROBABILITY 

a legitimate place within the body of knowledge. A two-valued logic has no 
place for unknown truth values; so long as the truth value of a statement 
is not known, as for statements about future events, classical logic does not 
allow us to judge the truth of the statement. All it offers is a “wait and see”. 
Our actual behavior, however, does not follow this maxim of passivity. We 
form judgments about the likelihood of the event and use them as a guide 
for action. We must do so, because action presupposes judgments about future 
events; if we had to wait until direct observation informed us about the 
occurrence or nonoccurrence of an event, it would be too late to plan actions 
influencing the event. 

Probability logic supplies the logical form of a truth evaluation by degrees 
that is applicable before the occurrence 1 of an event. It allows us to coordinate 
to the sentence about the individual event a fictitious truth value, derived 
from the frequency within an appropriate sequence, in such a way that, so 
far as actions are concerned, the fictitious truth value, or weight, satisfies 
to a certain extent the requirements that can be asked with respect to a truth 
value. The logical interpretation repeats the procedure followed in the object- 
language interpretation of the probability statement about the single case: 
the metalinguistic conception of the probability of the single case as a weight 
of a statement, too, is constructed by a transfer of meaning from the general 
to the particular case. The numerical value of the frequency in the sequence 
is transferred to the individual statement in the sense of a rating, although 
the individual statement taken alone exhibits no features that could be 
measured by the rating. 

In spite of the fictitious nature of the rating so constructed, the system 
of posits endowed with weights can be substituted for a system of statements 
known as true or false. The essential difference between the two systems 
consists in the fact that in the substitute system our action is determined, 
not by a knowledge of the truth value of the statement about the individual 
event, but by a knowledge of a truth frequency in a sequence. The substitution 
of this statistical knowledge for unavailable specific knowledge is justified 
because it offers success in the greatest number of cases. This is why we can 
act when the truth of the sentence about the individual event remains un¬ 
known : the frequency interpretation of probability replaces the unattainable 
ascertainment of truth by a procedure that accords the best success attainable 
in the totality of cases. 

It has been objected that the frequency interpretation of the probability 
of the single case does not correspond to what a person actually means when 
he regards an individual event as probable. My answer may be found in the 
discussion of meanings at the beginning of § 72: the objection seems irrelevant 
because logical analysis is not concerned with the description of subjective 
images and intentions associated with words. The use of the word “probable” 
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with reference to individual events, or as denoting a truth value of sentences 
about individual events, can be given a meaning in terms of frequencies; 
if this meaning is assumed, our words will be made compatible with our actions. 
This result seems to be established beyond doubt. Since there is no other 
way of determining meanings than by defining interpretations that make 
language correspond to behavior, I do not see on what grounds the universal 
applicability of the frequency interpretation of probability can be questioned. 

§ 74. Probability Meaning 

The interpretation of probability as a truth value permits the introduction 
of a new category of meaning. The verifiability theory of meaning, in its strict 
form, makes meaning dependent on verifiability as true or false. The theory 
can be extended, however, so that a sentence is regarded as meaningful when 
it is possible to determine a weight for the sentence. 1 By “possible” we under¬ 
stand here “physically possible”, that is, “compatible with physical laws”. 
The meaning so defined is called 'probability meaning. 

The advantage of the new category of meaning derives from the fact that 
a determination of weight may be physically possible, whereas a corresponding 
absolute verification is not physically, but only logically, possible. With 
respect to simple sentences, for example, sentences concerning the weather 
of the next day, the distinction is irrelevant; here it is physically possible 
both to determine a weight in advance and to verify after the occurrence 
of an event. But this cannot be done with more complicated sentences. Thus, 
a statement about the temperature of the sun cannot be strictly verified in 
the sense of a physical possibility of verification, but it is physically possible 
to determine a weight for it. Probability meaning , therefore, represents a wider 
category than physical truth meaning , that is, a meaning defined by the 
physical possibility of strict verification. But it is a narrower category than 
logical meaning , a meaning defined by the logical possibility of strict verifica¬ 
tion (see § 66). When the term “verification” is used in a wider sense, to 
include the determination of a weight, probability meaning represents a 
physical meaning, since it is based on the physical possibility of verification. 
It can be shown that probability meaning constitutes the very category of 
meaning that underlies conversational and scientific language, for which 
physical truth meaning is too narrow and logical meaning too wide. 

The meaning of limit statements, analyzed in § 66, may now be reconsidered 
in the light of the category of probability meaning. It is only logically, not 
physically, meaningful to speak about the limit of the frequency of an infinite 
sequence that is extensionally given; therefore a finitization of limit statements 
is required for applications to physical reality. This classification is based 


i See EP , § 7. 
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on the postulate of absolute verification. If verification in the wider sense is 
admitted, it is sometimes physically possible to verify the limit statement 
for an infinite extensional sequence, namely, when it is physically possible 
to determine a weight for it. 

These conditions are realized in a probability lattice. Here the individual 
horizontal sequence may be regarded as a single case the weight of which is 
determined by a frequency of sequences counted in the vertical direction. 
In an arrangement of this kind the limit statement concerning the infinite 
extensional sequence has probability meaning. 

These considerations correspond to actual situations. We make statements 
about the probability of a limit of the frequency at a certain value p, and we 
can also compute the probability that for a given degree of convergence 
€ the n-th element of the sequence is a place of convergence. Such results 
concerning probabilities of a higher level can be derived within the frame 
of Bernoulli’s theorem. If we admit probability meaning, therefore, we can 
dispense with the requirement of finitization. 

This analysis, however, has a restricted value. A computation of probabil¬ 
ities of the higher level is possible only when limits of the higher level are 
known. Whereas statements about limits of the first level are thus given a 
probability meaning, those concerning limits of the second level do not have 
this sort of meaning. True, a probability meaning for statements of the second 
level can be constructed by computing probabilities of the third level, but 
then new statements are introduced that do not have a probability meaning. 
In other words, a probability meaning can be constructed for every limit 
statement, but not for all 

The use of probability meaning for the discussion of limit statements is 
therefore restricted to an advanced date of knowledge , a state in which a suffi¬ 
cient number of probabilities is known (see § 70). However, within a primitive 
state of knowledge —a state preceding the determination of probability values 
and therefore the kind of state on which the determination of the first prob¬ 
ability values must be based—probability meaning cannot be used. Since no 
w r eight is known for limit statements of the highest level, they cannot be 
incorporated in the frame of probability meaning and must be subject to the 
method of finitization explained above (§ 60). The general theory of induction, 
in particular, must be given with respect to primitive knowledge and, there¬ 
fore, without the use of probability considerations. 




Chapter io 


PROBABILITY 


LOGIC 




PROBABILITY LOGIC 


§ 75. The Problem of a Multivalued Logic 

We turn now to the technical construction of the probability logic envisaged 
in § 73, in which the alternative true-false is replaced by a continuous scale 
of truth values. In the history of logic the question has been asked repeatedly 
whether the dichotomy into true and false is an ultimate necessity of our 
thinking, or whether the human mind has the capacity to dispense with it. 
The very beginnings of two-valued logic in Greek philosophy were accom¬ 
panied by criticism of this logic, which arose out of a consideration of the 
uncertainty of the future. In recent times, repeated attempts have been 
made to construct a generalized logic. 1 

These investigations, however, did not originate in the problem of prob¬ 
ability, but rather in the question of the modalities —the three categories of 
necessity, possibility, and impossibility. Consequently, interest centered 

1 As J. Lukasiewicz has stressed, the works of Aristotle contain passages suggesting that 
he thinks of a generalization of alternative logic into a three-valued logic. Whereas the 
Stoics, as strict determinists, emphatically maintained a two-valued logic, the Epicureans 
assimilated Aristotle’s idea. Among the more recent investigations, the intuitionism of 
Brouwer, rejecting the tertium non datur , belongs to this line of development, which leads 
away from alternative logic: L. E. J. Brouwer, hduitionisme en formalism,e. (Groningen, 1912), 
and in Bull. Amer. Math. Soc., Vol. 20 (1913). For further reference see the index of Erkennt¬ 
nis, Vol. II (1931), p. 151. 

Other multivalued logics have been treated in the form of a calculus: Hugh MaeColl, 
Symbolic Logic and Its Applications (London, 1906); E. L. Post, in Amer. Jour. Math ., 
Vol. 43 (1921), p. 182; J. Lukasiewicz, in Ruch Filozoficzny (Lwow), Vol. V (1020), pp. 
169-170; and in Comptes Rcrulus Soc. des Sciences et dcs Lettres Varsovie , Vol. 23 (1930), cl. 3, 
p. 51; J. Lukasiewicz and A. Tarski, ibid., p. 1. Walter Dubislav, in Jour. f. d. reine u . 
angew. Math., Vol. 161 (1929), p. 107, used multivalued truth tables for a proof of the con¬ 
sistency of the calculus of functions. O. Becker, in Jahrb. f. Philos, u. phdnomen. Forschung, 
Vol. XI (1930), p. 497, treated the problem of modalities within the framework of a multi¬ 
valued logic. These investigations are connected with the concept of strict implication 
developed by Clarence Irving Lewis. A presentation is found in C. I. Lewis and C. H. 
Langford, . . . Symbolic. Logic (New York and London, 1932), Appendix II. Reports on the 
problem of multivalued logic are given in: C. I. Lewis, A Survey of Symbolic Logic (Berkeley, 
1918); J. Jorgensen, in Erkenntnis, Vol. Ill (1932), p. 73; S. Zawirski, in Revue de MStaph. et 
de Morale, Vol. 39 (1932), p. 503. The first application of a three-valued logic to physics was 
given by H. Reichenbach in Philosophic Foundations of Quantum Mechanics (Berkeley, 
1944), §§ 32-33, which also contains a presentation of this logic. 

Turning to probability, w r e find the first logical interpretation of the theory of probability 
in George Boole, The Laws of Thought (London, 1854), chaps, xvi-xxi. J. Lukasiewicz, in 
Comptes Rendus Soc. des Sciences et des Lettres Varsovie, Vol. 23 (1930), cl. 3, p. 72, mentions 
an attempt to construct a probability logic by means of truth tables, but his tables of a 
continuous logic are not compatible with the calculus of probability. A theory of causality, 
for which future events are objectively undetermined, w r as published by II. Reichenbach in 
Ber. d. bayer. Akad., math.-phys. KL, 1925, p. 133. At the congress for epistemology of the 
exact sciences at Prague, in 1929, he read a paper presenting the program of using a logic 
with a continuous scale of truth values for the treatment of the theory of probability: see 
Erkenntnis, Vol. I (1930), p. 158. The author’s first publication of probability logic was 
made in Sitzungsber. d. preuss. Akad., phys.-math. KL, 1932, p. 476; that of the theory of 
induction, including the concept of posit, in Erkenntnis , Vol. Ill (1932), pp. 401-425. 
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around a three-valued logic corresponding to these categories. Furthermore, 
a five-valued logic was conceived, including truth and falsehood in addition 
to the three modalities mentioned above, thus uniting five concepts in one 
logic. All the investigations were focused on the problem of extending the 
truth tables to a system of more than two truth values. Yet, since there are 
many methods for such extension, the systems could not be constructed with¬ 
out arbitrariness. Some of the constructions are merely formal; so the question 
of their interpretation is left open. But even in attempts that preserved the 
connection with the meaning of the modalities, no satisfactory correspondence 
between the constructed system and the actual usage of language was attained. 

The present method of investigation has the advantage of connecting the 
problem of a multivalued logic with the analysis of the calculus of probability. 
Jt will be seen that the logicized form given to this calculus greatly facilitates 
the attempt to construct a logic in which the concept of truth is replaced by 
the concept of probability. We shall find the laws of this probability logic by 
transcribing the laws of the calculus of probability into a multivalued logic. 
This method is free from the arbitrariness that impeded the progress of other 
investigations, and leads to a logic of a continuous scale of truth values, a 
quantitative logic. 

Because of this immediate access to probability logic, we need not con¬ 
struct it by way of a detour, going into a discrete ^-valued logic first and 
then extending the latter to the case of a continuous scale of degrees of truth. 
We shall be able to proceed inversely, deriving discrete logics from prob¬ 
ability logic by dividing the continuous scale in a suitable fashion. A three¬ 
valued logic of modalities will thus be constructed. 

The generalization of the concept of truth to be given concerns the truth 
of empirical statements but not mathematical truth. Thus the objective will 
not be to construct an interpretation in which such statements as the theorem 
of Pythagoras are regarded as “merely probable”. Such a development is 
ruled out because mathematics is not concerned with the concept of truth 
but with the concept of tautology (§4). A tautology is a formula that is true 
whatever be the truth values of the elementary propositions of which it is com¬ 
posed. The term “truth”, which occurs twice in the; sentence, is the concept 
used in empirical science. It is a matter of experience to discover whether the 
elementary components of a compound proposition are true. Only for a 
tautology will this be irrelevant, since the truth of the tautology is inde¬ 
pendent of that of its elementary propositions. The investigations of mathe¬ 
matics, therefore, do not concern the truth of the elementary components, 
but the relation between this truth and the truth of the whole formula. It is 
the aim of mathematics to construct formulas the truth of which is inde¬ 
pendent of the truth of their components. The theorems of geometry, for 
instance, must be judged from this point of view. It is not asserted that the 
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theorem of Pythagoras is true, but that it follows from the axioms, that is, 
the implication from the axioms to the theorem is a tautology. Mathematics 
is thus not concerned with truth, but with a certain relation between truth 
values. Later it will be shown that the concept of tautology can be carried 
through also in multivalued logic. 

A notational remark is in order. Truth is a property of sentences, not of 
physical objects. Sentences that state the truth of sentences of the object 
language, therefore, belong in the metalanguage. Thus we should write 
“V (‘ a’) = 1” for the sentence, “The sentence ‘a’ is true”. For the sake of 
simplicity, however, tiie quotation marks will be omitted, and the paren¬ 
theses, used in combination with the symbol V or similar symbols, will be 
regarded as a sign that includes the function of the quotation marks. We 
therefore shall write u V(a) = 1” for the sentence above. Similarly, we shall 
write, “the sentence a”, instead of writing, “the sentence V ”, allowing the 
word “sentence” in combination with the succeeding italics to assume the 
function of quotation marks. 


§ 76. A Quantitative Logic of an Individual Verifiability 

Before turning to the construction of probability logic, I wish to show, in a 
preliminary investigation, that a quantitative concept of truth can be coordi¬ 
nated to individual statements in such a way that the degree of truth can be 
determined for each statement directly, without reference to a frequency 
within a sequence. A quantitative logic of this kind may be said to be of an 
individual verifiability. It does not lead to a logic of probability, however, 
since the latter is based on frequency notions and is therefore of a nonindi¬ 
vidual verifiability; but the logic so constructed may be used to illustrate 
certain fundamental features of a quantitative logic. 

In two-valued logic the two truth values are so defined that certain facts 
verify the statement, whereas others falsify it. This means that the statement 
divides all facts to which it is related into tw r o classes: those that make the 
statement true, and those that make it false. For example, the statement, 
“The weather is good”, will be called true if the sun is shining and no wind 
is blowing, but also if the sun is occasionally covered by clouds or a soft 
wind is blowing. Saying that the statement is false likewise includes several 
other possible facts, for example, the possibility of a strong wind without 
rain, that of rain without wind, and so on. 

It is possible to go from a dichotomy to a quantitative verification by 
ordering the facts with respect to the degree to which they satisfy the state¬ 
ment. The example mentioned admits of such an interpretation. The weather 
can be more or less perfect, that is, it can have all gradations of intermediary 
forms ending with extremely bad weather. We can therefore ascribe to the 



390 


PROBABILITY LOGIC 


statement a degree of truth depending on the observed facts. Such gradations 
of statements occur in everyday language. We frequently say, “This is true 
to a certain degree”; “This is half true, half false”, and so on. 

The method may be analyzed by means of a simpler example. A marksman 
says, “I shall hit the center” (statement a). After the shot we measure the 
distance r of the hit from the center; r is a measure of the degree of truth of 
the statement a (fig. 26). 

In order to obtain the values 0 and 1 for the 
limiting cases, we may take the expression 


v(a) 


1 


1 + r 


( 1 ) 


as a truth value. Let r x be the value of the 
radius of the hit. We then have 


v{a) = 


1 


1 + n 


( 2 ) 



Fig. 26. Hit cm a target as ex¬ 
ample for a logic with continuous 
scale of truth, according to (2) 
and (3). 


( 2 ) is the truth value of a in the continuous logic. 

Obviously it is not necessary, however, to 
use a quantitative logic in this example. The 

same fact can be incorporated in two-valued logic in either of two ways: 

1 . The values of the radius r may be divided into two classes by a demarca¬ 
tion value r 0 ; and, accordingly, the values v are divided by the demarcation 
value v 0} corresponding to r 0 . We now put 

V(a ) = 1 for 1 ^ v ^ v 0 V(a) — 0 for v Q > v ^ 0 

or, for r ^ r 0 or, for r> r Q (3) 


In this manner the continuous scale of values v has been transformed into the 
alternative V by a dichotomy. The method has the disadvantage, however, 
that the truth characterization is diminished in content. If we know only 
that V (a) — 1 , we do not know at which point r within r Q the hit is situated, 
but the expression ( 2 ) informs us about this value. 

2 . We go from the consideration of the degree of truth of the statement a 
to the consideration of the truth value of the metalinguistic statement 

v (a) = -— 7 — abbr. ^ (1) (4) 

l +n 


The addition on the right side means that we abbreviate the statement by 
the symbol £ (1) . The superscript in the symbol indicates that we deal here 
with a statement of the first metalanguage, whereas the statement a belongs 



§ 76. QUANTITATIVE LOGIC OF INDIVIDUAL VERIFIABILITY 391 

to the object language. The accent in the symbol a (1) indicates that the 
statement is true. We have, using the second metalanguage, 

V(a (l) ) = 1 (5) 

This method preserves the full content of the statement. It amounts to the 
same as the use of the object-language statement, “The shot hit at the dis¬ 
tance r”. It is clear, however, that the statement can be replaced by the 
metalinguistic formulation above. 

The two-valued character thus represents merely a principle of division 
by which statements are classified. It could be replaced by another principle. 
When we ask, “Does a hold?” we commit ourselves to the principle of dichot¬ 
omy, expecting the answer “Yes” or “No”. It would, in fact, be possible 
to ask, “To which degree does a hold?” and to give the answer in the form, 

“a holds to the degree —-—”. The two methods constitute different forms of 
1 + r 

linguistic classification. 

We must not believe, however, that we are bound to a two-valued meta¬ 
language. In place of the true statement a (1) we can introduce another state¬ 
ment a (1) , which, in its turn, is incorporated in a quantitative logic. Thus 
we may put 

v(a) = —-— abbr. a (1) ( 6 ) 

1 + r 

If r = n, a (1) is true and is identical with a (1) . If r < r h we have 

*’(a m ) = r-t— p = ri — r (7) 

1 + P 

The metalanguage has here a continuous truth scale like the object language. 
We arrive at a two-valued language only when we go into the second meta¬ 
language, which contains the true statement 

— —-— abbr. a (2) ( 8 ) 

1 + P 

The system of the statements a, a (1) , a i2) is now equivalent to the system of 
the statements a, < 2 (1) . This is confirmed by a retranslation: b {2) indicates 
that the value r in a (1) is too small by the amount p. If we put for r in a (1) 
the value r + p, which is = r h we obtain the true statement & (1) instead of 
a (1) . Using the same procedure, we can also obtain in place of the statement 
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which is not completely true, the true statement 

r = ri abbr. a (10) 

This statement of the object language is equivalent to the system a , a (l) . 

However, the system a , a (l) is not equivalent to the true statement a. 
This results from the fact that the last statement of this system is not true. 
If we want to cut off a system of multivalued statements after a finite number 
of levels of language, we must have a true statement as the last one. Language 
usually contains the implicit convention that an uttered statement has the 
truth value 1. This principle can be maintained for the last level of finite 
systems of multivalued statements. 

But we are not bound to use finite systems. We can use infinite systems 
that contain no true statement on any level. For each statement, the degree 
of its truth will be stated, but the statement concerning the truth will itself 
be true only to some degree. Using the example, we can construct an infinite 
system of metalanguages of this kind. We put 

v(a (i ~ ]) ) — —-— p t = — . n abbr. a (i) (11) 

1 + 2* 

According to the principle of retranslation, the true statement a (i) coordi¬ 
nated to a U) is here given by 

i’(tt (, " 1) ) =--—-- abbr. a (t) (12) 

1 + Pt + Pm 

m — ? +1 

Because of the definition of p» we obtain, for instance, 

1 

v(a (l) ) = Ti abbr. a (2) 

1 + 2 03 ) 

v(a) — -abbr. a (1) 

1 + 7*1 

The infinite system defined by ( 11 ) is thus equivalent to the system a, a (1) 
given by ( 9 ) and ( 4 ). The system will include (0) and (8) if we put in (6) 

n ■ /q\ r * 

r = -. m (8) P = -• 

Table 8 clarifies these relations. 

All the systems of statements given in the table are equivalent to each 
other. They describe the same facts. In the three initial columns, only true 
statements appear from a certain level on, but the fourth column does not 
contain any true statements. This system, therefore, cannot be cut off. It is, 
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however, a convergent system, that is, if the system is cut off at a (i) , and a (i> 
is regarded as true, the mistake will be as small as we wish when i is suffi¬ 
ciently great. 

The convergent character is not a necessary condition for such systems. 
We show this by the slightly changed example 

r = r 2 abbr. a (14) 

f(a ( *~ 1} ) =- p t ‘ = ( — 1)* • (r 2 — 7*i) abbr. a U) 

1 + Pi 


This system has the following properties. When we regard a (1) as true, the 

true statement corresponding to a will be r = rq; when we regard a (2) as true, 

the true statement corresponding to a will be r = r 2 . The two results will 

change alternatively when we cut off on higher levels. The system (14), 

therefore, says either that r = rq or that r = r 2 , and is equivalent to the 

statement , x w , N n ^ 

(r = n) V (r = r 2 ) (15) 


which belongs in a two-valued object language. Incidentally, the truth scale 
used in (14) is not enclosed in the limits 0 and 1; this, however, is irrelevant. 

From these considerations we see that a multivalued logic of individual 
verifiability can always be carried through. This holds not only for a logic 
of a continuous scale but also for any v -valued logic. But statements of a 
multivalued logic can always be translated into statements of a two-valued 
logic by one of the two methods described. Conversely, all statements of two¬ 
valued logic can be translated into statements of multivalued logic if a suit¬ 
able rule for the construction of truth values is added, corresponding to the 
definition ( 1 ). 

The two-valued character of classical logic is a convention comparable to 
that used in the decimal system for the notation of numbers. The latter is 
constructed on the arbitrary basis of the number 10 ; logic is constructed on 
the arbitrary basis of the number 2 . There are philosophers, indeed, who 
think that an ultimate truth concerning the structure of the universe is formu¬ 
lated in the statement, “Every proposition must be either true or false”; or 
in the equivalent statement, “Every fact must either be or not be”. But 
such a statement is no more justified than the statement, “Every number 
over 99 must be written with at least three digits”. The latter sentence is 
true for the decimal notation, but false if it is asserted for all notations. It 
says nothing about numbers, but states merely a property of the notation 
used. Similarly, the preceding proposition concerning truth or existence is 
true only within two-valued logic, but false if it is asserted without restric¬ 
tions. In multivalued logic it does not hold, cither in the first or in the second 
version given. Here a fact can “exist halfway”; it will do so when the sentence 
describing it is half true. 
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§ 77. Probability as a Property of Propositional 
Sequences 

The logical conception of probability was introduced earlier with reference to 
individual propositions, but with the qualification that the truth value into 
which the probability is thus transformed is of a fictitious nature, since it is 
derived from a frequency. Probability logic, in this conception, is a logic of 
nonindividual verifiability. It is therefore advisable to attach the technical 
construction of probability logic, not to individual propositions, but to 
sequences of propositions. We can then apply the frequency interpretation 
directly and need not deal with fictitious properties. 

Turning to the construction of suitable sequences of propositions, we begin 
with the implicationai form of writing introduced in § 9: 

(i)(zi€ A XitB) (1) 

p 

Using the two propositional functions 1 

hz, = D , 2 , e A fx% = Df Xi e B (2) 

we can write, instead of (1), 

(3) 

P 

In the /'’-notation we have for this: 


PQiZiJXy) = p 


(4) 


All this is object interpretation. Introducing the logical interpretation, we 
should have to write = p (5) 


The elements of the sequences are now given by individual propositions of 
the form “hz” and “fx”. The frequency interpretation is constructed by 
counting the number of true propositions “fx” within the subsequence se¬ 
lected by true propositions “hz ”. 

Since the notation by quotation marks is inconvenient, these marks will 
be omitted, and their function will be assumed by parentheses, in corre¬ 
spondence with the rule for the V-symbol introduced at the end of § 75. 
This means that we shall use (4) instead of (5). Without further indication, 
probability expressions belong in the metalanguage from now on. This is 
possible because all rules for metalinguistic expressions correspond strictly 
to those employed in object-language expressions. 

1 1 use here a notation for argument variables that differs from that used above. Further¬ 
more, I omit the parentheses in propositional functions. 
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A further linguistic obstacle must be overcome if probability is to be re¬ 
garded as a property comparable to truth. Truth is a property of one sentence, 
whereas probability is a relation between two linguistic expressions. We can 
eliminate this difficulty by assuming the first sequence, indicated by the term 
hzi in (4), as compact. It is then identical with the sequence of the subscripts 
and can be omitted. We thus write, adopting the notation of (1, § 24), 

P(M = V (6) 

The sequence of propositions derived from the propositional function / will be 
called a propositional sequence , and is denoted by the symbol (fx z ). The 
sequence of the elements Xi will be denoted by (x t ). Probability appears here 
as a property of a propositional sequence and is of the same logical type as 
truth, which is a property of a proposition. 

We can construct for (6) the frequency interpretation, which has a form 
corresponding to (4, § 10). Since, however, we use the logical interpretation 
of probability, we must count, not events, but sentences about events. 
Therefore we write the frequency interpretation of (G) in the form 

P(fx,) = limi N j V[fx t ] = 1J (7) 

n-> oo i = l 

The form (7) shows the close analogy between the metalinguistic and the 
object-language interpretation of probability. For frequencies of events we 
substitute, in the metalinguistic or logical interpretation, frequencies of sen¬ 
tences about events. All theorems of the object interpretation can therefore 
be transferred immediately to the logical interpretation; the two interpreta¬ 
tions are isomorphous. This is why they need not be distinguished by a spe¬ 
cific notation and quotation marks are dispensable. 

It should be kept in mind that a propositional sequence is not the same as 
a propositional function. To the propositional function / we must add the 
ordered sequence of events (, x % ); it is this combination that makes up the 
propositional sequence. *We see that a propositional sequence shows a certain 
analogy to propositions, since the latter can be regarded as a combination 
of a propositional function and one argument x v . 

These explanations prepare the way for the construction of probability 
logic. We shall construct it as the logic of propositional sequences—as a logic 
the elements of which are propositional sequences as wholes. The extension 
of logic thus envisaged is based on the idea that the logic of propositional 
sequences can show more general features than the logic of the propositions 
constituting the elements of these sequences. The two-valued character of 
the elements need not be transferred to the compound expressions constructed 
from them. 
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We may compare the transition from alternative logic to probability logic 
with the transition from Euclidean geometry to non-Euclidcan geometry. 
Although Euclidean geometry holds for small areas of a Riemann space, it 
does not determine the structure of large areas, which, in general, is of a more 
complicated type. Similarly, the alternative character holds for the ‘ logic of 
smallest domains”, namely, of propositions, whereas the continuous scale 
holds for the “logic of large domains”, that is, of propositional sequences. 

The comparison may be used for an analysis of the psychological factors 
involved in the historical predominance of two-valued logic. The corre¬ 
sponding predominance of Euclidean geometry is derived from the fact that 
our environment satisfies the axioms of Euclid to a high degree of approxima¬ 
tion so far as relative positions of solid bodies are concerned. If this environ¬ 
ment were of a different structure—if it showed, in the dimensions of our 
houses, the deviations asserted by Einstein for cosmic dimensions-—mankind 
would have developed from the outset a non-Euclidean conception of space. 

Probability logic is in a similar position. In daily life we usually deal with 
phenomena that have a very high degree of probability. By putting these 
high probabilities = 1 it is possible to construct a sufficient idealization that 
knows only the two extreme points. For phenomena of a lower degree of 
probability, however, such an idealization is inadequate and must be replaced 
by the construction of a logic of a continuous truth scale. 

§ 78. The Probability of Finite Propositional Sequences 

For finite sequences (x t ) the frequency definition of probability (‘an be main¬ 
tained if we understand by the limit the value of F r, (fXi) for the last term. 

In this case only the n + 1 values —>—•••- are possible for the frequency, 

n n n 

and we are no longer dealing with a logic of a continuous scale, but with a 
v-valued logic for v = n + 1. 

Now it is clear that for finite sequences, in contradistinction to infinite 
sequences, the order of the elements has no influence upon the limit of the 
frequency. Consequently, for finite sequences, it is permissible to speak 
directly of the frequency. 

A further difference from infinite sequences consists in that for finite 
sequences the limit of the frequency can be 1 only if fx L holds for all x t . Cor¬ 
responding considerations hold for the limit 0. With respect to finite sequences, 
therefore, we need not differentiate between general implication and a prob¬ 
ability implication of the degree 1, cases that must be distinguished for infinite 
sequences (see §§ 12, 18). 

By going to smaller and smaller values of n we can accomplish a transition 
from a v-valued logic to the two-valued logic. It obtains for the case n — 1, 
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that is, for the case that the sequence is reduced to one element. The expres¬ 
sion (fx i) thus means the same as fx\. 

Because the value of the frequency for such sequences can only be 1 or 0, 


we have here 


P(fx{) = 1 or P(fxi) = 0 


( 1 ) 


In this case, therefore, the two possible probability values coincide with 
truth and falsehood, and we have 


[P(fxi) = 1] - [V(fx 0 = 1 ] ( 2 ) 

\P(Jxx) = 0] = [V(fx0 = 0] 


Truth and falsehood can thus be regarded as the limiting cases of probability 
resulting when the sequence is reduced to one element. 


§ 79. Truth Tables of Probability Logic 

Returning now to the probability logic with a continuous scale, I shall show 
that in this logic truth tables comparable to the truth table 1 (p. 17) can be 
constructed for propositional operations. 

First, operations for propositional sequences analogous to those holding for 
propositions must be defined. For this purpose the argument sequences (Xi) 
and ( yi ) of the propositional sequences must be coordinated; corresponding 
elements are indicated by the equality of the subscript. Second, the operation 
combining two propositional sequences is defined in terms of the corresponding 
operation combining the elements of the sequences: 

(fXi) v (gy,) = D! (fx, V gy,) 

(fx ,). (gy,) = Df (fxi.gy ,) 

(fx,) 3 (gy,) = Df (fxi 3 gy,) (la) 

(fx,) - (gy,) = Dt (fxi = gy,) 

and correspondingly 

(fxi) = Df (fx,) (lb) 

Because of these definitions we can now use the frequency interpretation, 
in the form (7, § 77), for the probability of the propositional combinations. 
We have, for instance, 

P((fXi) V (gy,)) = P(f Xi V gy,) = Urn - N { V\fx, V gy,] = 1} (2) 

n-¥ 00 Ti 

Similar relations hold for the other propositional operations. 



399 


§ 79. TRUTH TABLES OF PROBABILITY LOGIC 

We must now introduce a new propositional operation that will allow us 
to write degrees of relative probabilities. Since probabilities of this kind are 
written in the form P(fx l ,gy l ) 1 the content of the parentheses in this expres¬ 
sion can be regarded as a compound proposition, resulting from the two 
components by a propositional operation, denoted by the comma. We shall 
call it operation of selection , or comma operation. We can also put the comma 
between propositional sequences. We then define, by analogy with (la), 

= Df (fxi,gy,) (3) 

The frequency interpretation of P(fx t ,(jy,) is given by the expression 

N { V[.fx t .gy t )\ = 1} 

e(fx„gy.) = lirn - (4) 

N {F[/x,] = lj 

l-l 

This form makes clear why we speak of the operation of selection. The 
proposition fx x selects the subsequence in which we count the frequency of 
gy x . Since (3) allows us to regard the comma on the left side of (1) as standing 
between the propositional sequences, we can also say that the comma opera¬ 
tion represents a selection from one sequence by another sequence. 

Turning to the construction of the truth tables, we now meet an intrinsic 
difference between probability logic and two-valued logic. The truth table 1 
(p. 17) contains as arguments the two truth values of the elementary proposi¬ 
tions, so that in the adjunctive interpretation the two truth values determine 
the truth value of the compound proposition. In probability logic, on the 
contrary, two arguments are not sufficient; we need a third argument when 
the probability value of the compound propositional sequence is to be deter¬ 
mined. This follows from the considerations leading to (7', § 21), which show 
that for the determination of the probabilities of compound events three 
fundamental probabilities are required. The three values P(A,B), P(A,C) } 
P(A .B,C) were chosen for this purpose. In (9, § 24) the notation that omits 
the reference class A was applied. Transcribing these results for the case of 
probability logic, we regard the three probabilities 

P(fXi) P(gy t ) P(fxi,gy t ) (5) 

as the arguments of the truth tables; the probabilities of other combinations 
are determined by these values. According to the remarks added to (7', § 21), 
we could replace the value P(fxi,gyf) by that of any other of the propositional 
combinations, for instance, by the value P(fx x .gy x ). It is a matter of con¬ 
venience that we prefer to use the three values (5) as the independent 
parameters. 
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The construction of the truth tables is now easily achieved by the use of 
formulas (7, §13), (8, §20), (3, §14), (22, §20), (23, §20), and (C, §21). 
The results are compiled in table 9, A and B . 

The table includes a restrictive condition, holding for the numerical values 
of the three arguments. As pointed out in § 19, the three fundamental values 
(5) are subject to numerical restrictions, which were formulated in (15, § 19). 
This condition can be written in the form 


V + g - l 
V 


< u < 


( 6 ) 


This inequality leaves the three values p, g, u independent of each other 
within wide limits. For the case p — 1, however, we derive from (6) that 
u — q. In this case, therefore, we have only two independent values. For 
p — 0 there is no such mutual dependence. The second restrictive condition 
must be added to the table because the value 1 of u for the identical functions 
/ and g cannot be derived from the table (see 4, § 82). This condition has a 
similar function as axiom n, 1 (§ 12). 

It can now be shown that probability logic is a generalization of two-valued 
logic, or, more precisely speaking, that the truth table 1 (p. 17) of two¬ 
valued logic is contained as a special case in the truth table 9 of probability 
logic. This is proved as follows. Two-valued logic can be regarded as the 
special case that the numerical values of the probabilities are restricted to 
0 and 1. The first two columns in table 9 B, on this restriction, can assume 
only the four combinations 1,1; 1,0; 0,1; 0,0; the combinations are presented 
in table 9C. Since for p = 1 we have u = g, the third column in table 92? 
loses its independent character. This means that the third column of table 9 C 
becomes a function of the two preceding columns; the vertical double line 
separating arguments from functions thus goes to the left by one column. 
For the case p = 0 the indeterminacy in the value of u has been indicated 
by a question mark. When we now insert, in the expressions of the other 
columns, the respective values of p, g, and u , we obtain table 9C, the columns 
of which refer to the respective headings of table 9 B. Thus, putting p = 1, 
q = 1, u = 1, we obtain for the disjunction the value 1 + 1 — 1 • 1, which 
is = 1; and so forth. It turns out, and is easily verified, that thereby the inde¬ 
terminacy indicated by the question mark drops out for all the operations 
used in two-valued logic. Thus the combination p = 0, g = 1, w = ?, gives 
for the disjunction the result 0 + 1 — 0 • ?, which is = 1, since the question 
mark means any finite number, and thus its product by 0 will give 0. Only 
for the operation ( gyijxi ) will the question mark reappear, namely, in the 
cases P(gy%) = 0. We thus obtain the analogue of table 9C. 

When we compare table 9C with the truth table 1 of two-valued logic 
(p. 17), we see that the tables are identical with respect to the operations 
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of disjunction, conjunction, implication, and equivalence. The values 1 and 0 
of 9 C correspond to the values T and F of tables 1. It is clear that, in a similar 
way, table 9 A of the negation can be shown to be identical with table 1A 
when the value of p is limited to 1 and 0. 

That the tables of two-valued logic are contained in those of probability 
logic as a special case need not surprise us. In fact, this is a consequence of 
the previous result (§ 18), according to which the axioms of probability follow 
from the frequency interpretation and are strictly satisfied even by finite 
sequences. Referring to the remarks in § 78, we can, therefore, regard two¬ 
valued logic as the probability logic holding for propositional sequences having 
only one (dement. 

The third column of table 9 C can be regarded as the definition of the indi¬ 
vidual operation of selection, holding between propositions. What distinguishes 
this truth table from those of the other operations is the occurrence of a 
question mark, that is, of cases where the truth value is not determined. 
Truth tables of this kind may be called defective; and the individual operation 
of selection, which holds between propositions, may be called a defective 
operaiiov. In this application of the table, the indeterminacy indicated by the 
question mark must be interpreted, not as meaning that there exists no truth 
value, but as meaning that all real numbers are assigned as truth values to 
the expression. This follows from the corresponding property for empty refer¬ 
ence classes (see 7, § 12). 1 For the question-mark cases, the individual opera¬ 
tion of selection thus leads to a compound expression which is both true and 
false, i.e., an expression that is not a proposition. For these reasons, the com¬ 
pound established by the comma operation differs in kind from the other 
operations and must not be substituted for elementary variables or used as 
a term in a propositional operation. [For an exception see (31, §82).] A 
restricted rule of substitution will be introduced later (see 24, § 82). 

This conception of the individual operation of selection will meet the 
requirements of the frequency interpretation when the latter is introduced 
in the general form „ 

N { V[hx x ] = 0} 

P(hxi) = lim —^- (7) 

w *" N { V[hx t ] = I V V[hXi] = 0} 

i — 1 

This expression will, in general, be identical with (7, § 77). But when the 
functional hxi has the special form [fxi,gy t ], the denominator will be given 

1 For a sequence of infinite length in which p ~ 0, the value u need not. be indeterminate, 
although it is not determined by p and q. This is the case if the functional fxi is not always 
false, although the limit of the frequency is = 0. In the subsequence selected by Jx x , the 
relative frequency of gy x has then a definite value. The question mark in table 9 B means, 
therefore, that either the value of u is unknown or that u has all real numbers as its values. 
The latter case results when fx t is false for all x x and is thus realized for a finite sequence 
in which p = 0. 
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only by the cases in which this functional has one determinate truth value 
and is thus either not true or not false. These are the cases where fx z is true. 
Similarly, the numerator will count only the cases where V [hx^\ = 1, excluding 
the question-mark cases. This is precisely the frequency interpretation given 
in (4). 

Several critics have raised the objection that the probability logic pre¬ 
sented here is not extensional. When “extensional” is taken to mean a logic 
in which the truth value of a binary operation is determined by the individual 
truth values of its two elements, the criticism is formally correct. But it is 
materially misleading, since it uses a very narrow definition of the term 
“extensional”. In § 4 it was explained that the term “extensional” should 
rather be replaced by the term “adjunctive”. Only the latter term denotes 
the properties that are relevant for a logic of this kind and which constitute 
the background of the somewhat vague term “extensional”. A logic is adjunc¬ 
tive if its truth tables can be read from left to right, in the direction from the 
arguments to the function, and not only from right to left, or in the direction 
from the function to the arguments. It is clear that the truth tables of prob¬ 
ability logic are adjunctive. If the term “extensional” is to have a reasonable 
meaning it should be identified with the term “adjunctive”. Therefore it may 
be said that probability logic is extensional in this wider sense. 

The peculiar form of probability logic explains why this generalization of 
two-valued logic was not found so long as logicians were seeking such a gen¬ 
eralization in terms of the two-valued truth tables, disregarding the calculus 
of probability. Probability logic involves a generalization in which a function 
of two arguments is replaced by a function of three arguments; two-valued 
logic appears as the degenerate case in which the third argument becomes a 
function of the other two. The degenerate case was erroneously assumed as 
determining the form of every truth-functional logic. However, two proposi¬ 
tional components determine, not two, but three truth values: one for each 
component and a third for the combination of the components. Of this kind 
are the three arguments of probability logic: two are coordinated to the 
individual propositional sequences and the third is coordinated to the pair as 
a whole. The third probability value may be regarded as a measure of the 
degree of coupling existing between the two sequences. 

§ 80. Truth Tables of the Logic of Modalities 

Two-valued logic is contained in the truth tables as the special case n — 1. 
Similarly, every other metrical ^-valued logic is contained in the tables as 
the special case n = v — 1. By metrical logic we understand a logic the truth 
values of which can be interpreted as probabilities in the sense of the fre¬ 
quency interpretation. Two-valued logic, also, is a metrical logic. But a brief 
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analysis shows that only in the latter logic are the truth valuer of the opera¬ 
tions fully determined by the two individual truth values of their components. 
For v — 3, that is, n — 2, there are three possible truth values: 0, J, 1. Here 
the values of the operations are not determined for the row p = q = 
and the value of u must be regarded as an independent parameter for this 
row. When v increases, the number of rows for which u must be independently 
given becomes larger and larger. 

The construction of metrical logics of a finite number of truth values, other 
than two-valued logic, is of no particular interest, because we prefer to replace 
such logics by the logic of the continuous scale. We do so for the same reason 
that we replace finite sequences of observations by infinite sequences (see 
§ fib). We arrive, however, at an important form of three-valued logic when 
we turn to the modalities. This logic is not a metrical logic, as will be made 
clear presently. 

Like probabilities, the modalities must be regarded as properties not of 
individual propositions but of propositional sequences. Only in a fictitious 
sense can they be transferred to individual sentences, in analogy to the con¬ 
cept of weight; but, for the present, such a conception will not be envisaged. 
By the use of the abbreviation Nc for necessity , Ps for possibility , Im for 
impossibility , and the symbol M(fx x ) for the phrase, the modality of fx tJ the 
modalities are defined as follows (for the circumflex notation see § 6): 

[M(fx x ) = Nc] = of (x x )f(x x ) 

[M(f£i) = Ps] = Df . (3^i)7(ii) (1) 

[M(fx t ) = Im] = DJ (x x )f{x x ) 

In this way of writing, the modality is regarded as a property of a situational 
function. In another conception the modality is considered to be a property 
of a propositional function. This conception results from (1) when the symbol 
M is interpreted as a metalinguistic symbol, like the symbol V. 

The term “possible” defined here has the meaning of “merely possible”, 
that is, it excludes necessity. This notation is convenient for logical purposes. 
For finite sequences, necessity and impossibility coincide, respectively, with 
the probabilities 1 and 0, whereas possibility corresponds to probability values 
between the two limiting cases. For infinite sequences, no such correspondence 
exists. Here the limit of the frequency, and therefore the probability, can be 
— 1 when the statement ( Xi)f(x x ) is not true; and similarly the probability 
can be = 0 when the statement (x x )f(x x ) is not true. The logic of modalities, 
therefore, is not strictly identical with that of the three probability values 
p — 0, 0 < p < 1, p = 1. But it turns out that the truth tables resulting for 
the latter three cases are identical with those holding for the categories de¬ 
fined by (1). These tables are given in table 10 (p. 406). They follow from table 
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9 (p. 400) when we construct the probability values of the operations on the 
assumption that p and g, respectively, represent one of the foregoing three 
cases. The tables can be established also by reference to the meanings of the 
three categories defined in (1). Thus if (fx^ is always true and (gyi) is always 
true, we see from the structure of the sequences that their disjunction must 
be always true, and so forth, for the other operations. 

Table 10 shows that in general the modalities of the compound proposi¬ 
tional sequence are determined by those of the elementary propositional 
sequence, except for the middle row, where commas are used to separate the 
possible values. Thus, if both (fx t ) and (giji) are possible, their disjunction 
will be either possible or necessary, but it cannot be impossible. If the am¬ 
biguity of the middle row is to be eliminated, we must go back to probabil¬ 
ities. Then the value of the three parameters p, g, u will determine the value 
of the disjunction. 

The ambiguity of the middle row of table 10 has sometimes been regarded 
as a disadvantage. It is certainly possible to construct other forms of three¬ 
valued truth tables 1 containing no such ambiguity, but it seems impossible 
ever to construe the modalities in such a way. In any interpretation of the 
modalities, the ambiguity of the middle row is a necessary concomitant, since 
it corresponds to the usage of these concepts. If two events are possible, it is 
undetermined whether their conjunction is also possible. They might be 
mutually exclusive events each of which is possible. Corresponding consid¬ 
erations hold for the disjunction. If two events are possible, their disjunction 
can be either possible or necessary. Thus if we cast two dice simultaneously, 
it is possible that face 1 of the first die turns up, and it is possible that face 1 
of the second die turns up; here it is merely possible that one or the other of 
the faces turns up, as well as it is merely possible that both turn up. However, 
if we toss a coin it is possible that heads turn up, and it is possible that tails 
turn up; but it is necessary that heads or tails turn up, whereas it is impos¬ 
sible that both turn up. 

The definitions (1) can be applied to functional compounds of the form 
[fxi,gyi], with the qualification that the cases where fxi is false are canceled 
for the operator expressions. In these cases, for which the truth table 9C 
furnishes a question mark, the individual compound is not a proposition and 
the operators thus cannot be applied. By the use of this rule a modality can 
be determined for the function [fx iy gyh except for the case that/x* is empty. 
This modality is given in the third column of table 10#; it shows an inde¬ 
terminacy in the middle row, like the other columns. In contradistinction to 
table 9#, a knowledge of the modality of [/x t -,g$ t ] would not enable us to 
determine the modalities of the other compounds. This is shown by the 

1 Of this kind are the tables constructed by Post and by Lukasiewicz and Tarski, which 
the author used for the interpretation of quantum mechanics. See footnote, p. 387. 
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examples above. If heads turn up, it is impossible that tails turn up; thus for 
the two faces of the coin we have the modality of impossibility for [fxi,gyi\. 
The same holds when we regard two faces of the same die. But in the first 
case the modality of the disjunctive compound event is necessity; in the second 
case it is possibility. 

This peculiarity of the logic of modalities is the reason that other attempts 
to construct a multivalued logic for the*, modalities did not succeed. Two- 
valued logic is governed by the principle of adjunctivity, according to which 
the truth value of the compound proposition is determined by the truth 
values of its components. The same principle holds for probability logic, in 
which, however, three truth values concerning the two components are re¬ 
quired for the determination. The assumption that the principle must hold 
also for the logic of modalities made it impossible to construct this logic. 
We saw that the modalities of the two components do not determine that of 
the compound expression; but we saw also that the introduction of a third 
argument would not remove this difficulty. It is therefore impossible to con¬ 
struct a table of modalities that can be read throughout from left to right; 
only the direction from right to left can always be used. The logic of modal¬ 
ities does not conform to the principle of adjunctivity; this logic can be 
understood only when it is viewed with probability logic as its background. 

The logic of modalities admits of a further application, which has no ana¬ 
logue in probability logic. The definition (1) of the modalities shows that we 
are concerned here with properties of sequences that, even in the case of an 
infinite number of elements, are independent of the order of the sequences. 
It follows that the definitions can be applied also when we choose as the 
sequence of elements the total range of objects for which the propositional 
function is meaningful, no matter in what order. (It is even irrelevant whether 
the range constitutes a denumerable class.) Since the range is determined by 
the propositional function alone, we can ascribe a modality, according to (1), 
to the propositional function itself without referring to a special sequence of 
arguments. Propositional functions of one variable can thus be conceived as 
three-valued expressions within the logic of modalities. 2 

The use of the modalities defined in (1) is subject to certain restrictions. 
The modalities do not fully correspond to the use of the respective categories 
in conversational language. Thus it may happen that no professor will ever 
enter a certain classroom of a university on Sundays; but we shall not speak 
here of impossibility. For the application of the modalities we demand not 
only that the all-statements used in the first and third of the definitions 
(1) be true, but that they have some further properties. In particular, we 

2 This conception, which was prepared by Russell, was used by Walter Dubislav, in Jour. 
/. d. rcine u. angew. Math., Vol. 161 (1021)), p. 107, for the construction of three-valued truth 
tables for propositional functions, with which the truth tables 10 (p. 406) are identical. 
These tables are used in ESL, § 28. 
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may require that the statements be derivable from the general laws of physics. 
A definition of modalities of this kind, which requires the metalinguistic inter¬ 
pretation, is given elsewhere 3 under the name of nornological modalities. The 
modalities defined by (1) are called extensional modalities. The general laws 
of physics are defined at the same place under the term nornological formulas. 
The nornological modalities of necessity and impossibility are subclasses of 
the corresponding extensional modalities; the extensional modality of pos¬ 
sibility is a subclass of nornological possibility. Table 10 (p. 406) applies also 
to nornological modalities and thus represents the general form of the logic 
of modalities. 

The difference between physical and logical modalities, explained with 
reference to possibility at the end of § 66, cannot be formulated in terms of 
extensional modalities. It requires nornological modalities because it is based 
on the distinction between synthetic and analytic nornological formulas. 

§ 81. The Logic of Weight 

Probability logic has been constructed in § 79 as a logic of propositional se¬ 
quences. The probability logic of individual propositions, or logic of weighty 
is constructed by a fictitious transfer of the truth properties of propositional 
sequences to individual propositions. We write 

P(a) = V (1) 

thus admitting individual propositions inside the probability functor. The 
number p measures the weight of the individual proposition a. It is under¬ 
stood that the weight of the proposition was determined by means of a suit¬ 
able reference class, but once the determination is completed, the indication 
of this class will be omitted and the notation (1) will be used. 

We can also construct compound propositions and determine their weights. 
Thus we can write 

I\a V h) = r (2) 

The weights of compound propositions are determined by truth tables result¬ 
ing from table 9, A and B (p. 400), when we replace the expressions/x, and 
gyi by the symbols a and h. The tables of the logic of weight are presented in 
table 11, A and B. Like the logic of probability, this logic requires three 
arguments for the determination of the weight of compound propositions; 
that is, the truth tables have three argument columns. The value 

P(a,6) = u 

is employed as the third argument value. 

3 ESL } §§ 23, 05. 


( 3 ) 
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Using the implicative notation, we would write (3) in the form 

a e- b (4) 

U 

Thus we have an individual probability implication. This concept is as ficti¬ 
tious a-s that of weight; it results from a transfer of meaning from the general 
to the particular case. The fictitious nature finds its expression in the way of 
writing the general probability implication: 

(0 (fXi ^ gy.) (5) 

V 

Here the degree p of probability, existing for the whole sequence, is regarded 
as holding for every individual case of the sequence, independently of the 
truth values of fx t and gy l . This conception is expressed by the use of the all¬ 
operator in front of the expression (5). It is clear that, when we regard the 
expression resulting from cancellation of the all-operator as meaningful, the 
conception of an individual probability implication has only a fictitious 
meaning. 

It is different with the E-notation, when the latter is interpreted by se¬ 
quences of two-valued propositions. When we write, instead of (5), 

Pifrt, gy*) = V (6) 

the expression, in this interpretation, must not be regarded as an all-statement 
the individual cases of which have the form (3). The individual statements of 
this propositional sequence have the form 

a,b (7) 

and must be regarded as propositions that an? either true or false, depending 
on the truth values of a and b, according to the truth tables of the operation 
of selection given in table 11G Y . Exception is made for the question-mark case, 
where the compound (7) is no proposition. Going from this individual opera¬ 
tion to statement (6), we cannot use an all-operator, but must use a method 
of counting the individual truth values, formulated in (4 and 7, § 79). We 
thus can regard the two-valued operation of selection as a substitute for an 
individual probability implication, 1 which has the advantage that it is not 
of a fictitious nature. Like other propositional operations, this operation is 
given a probability in the logic of weight, written in the fictitious form (3). 

In every logic we employ, besides the terms true and/abse, the term assertion. 
Whereas “truth” and “falsehood” are semantical terms, “assertion” is a prag¬ 
matic term, since it refers to the sign-user. In two-valued logic there is a simple 

1 In the German edition of this book (1935), p. 379, I identified the two operations, a 
method that I now regard as incorrect. See also my article, “Ueber die semantische und die 
Obiektauffassung von Wahrscheinlichkeitsausdrucken,” Jour . of Unified Science ( Erkennt - 
nis ), Vol. VIII (1939), pp. 61-62. 
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relation between these categories: a true statement can be asserted. In prob¬ 
ability logic, however, we must, in general, use statements that art; not known 
to be true. It is here that the term posit applies, which is a pragmatic term 
of the nature of the term assertion: we assert the best posit. We see that 
assertability is achieved, in probability logic, by the use of posits. 

Even when we restrict assertability to statements of the probability 1 — 
and we shall do so in §§ 82 83—we cannot dispense with posits. If the limit 
of the frequency is = 1, there, may yet be false statements in the propositional 
sequence. When we deal with each of the statements as true, we can do so 
only in the sense of a posit. 

Although positing is a behavior of the sign-user and thus a pragmatic affair, 
the phrase “best posit” is a semantical term. It is defined by reference to the 
truth of statements, namely, as the statement that will be true in the greatest 
number of cases. In probability logic, therefore, it occupies a place similar 
to that of the term “true statement” in two-valued logic. By the act of posit¬ 
ing, we select a certain class of statements as a basis for action. From the 
plurality of statements of various degrees of weight, we single out a unique 
group that serves the same purpose as the group of asserted statements of 
two-valued logic. 

Since the structure of probability logic is complicated, we often prefer 
methods that permit the use of two-valued logic instead of probability logic, 
thus equating truth and assertability. For a transition from probability logic 
to two-valued logic the following methods can be used. 

1. Method of division. By a procedure similar to the first method described 
in § 7(>, we divide the scale of probability in two domains, the first running 
from 0 to and the second from \ to 1. We regard statements of the first 
domain as false, those of the second domain as true. More often we use a 
trichotomy instead of a dichotomy: we choose within the scale of probability 
the two values p\ and p 2 so that pi is close to 0 and p 2 close to 1, and then 
call statements of the domain 0 to pi false , those of the domain p 2 to 1 true. 
Statements of the domain between pi and p 2 are omitted, that is, they are 
regarded as statements of unknown truth value. 

It is obvious that this method represents but another form of positing, 
since the statements defined as true can be maintained only in the sense of 
posits. The method has the advantage that the statements defined as false 
can be transformed into assertable statements by the use of the negation. 
However, the two-valued logic so introduced has the character of an approxi¬ 
mation, a result that is visible in certain peculiar consequences of the method 
of division. Assume that the weights p and q of two statements are situated 
in the domain from p 2 to 1; then the two statements will be called true. If, 
furthermore, the two statements are independent of each other, their con¬ 
junction will have the weight pq. Now if p and q are not much larger than p 2} 
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it may happen that pq < p 2 ; the conjunction of the two true statements will 
then not be true; and if pi is smaller than p h the conjunction may even 
be false. 

Cases of this kind may actually happen. Thus, if a judge is considering 
the testimony of an individual witness whose reliability has not been ques¬ 
tioned, he will regard the testimony as true; but he may hesitate to regard 
the statements of all witnesses as true. A way out of the situation, which is 
contradictory for two-valued logic, is given by returning to the probability 
nature of the weights assigned to the testimonies. Similar paradoxes will arise 
for a disjunction; if two or more statements are each false in the sense defined 
by the division, their disjunction may be true, or at least not false. In spite 
of these paradoxes, the method of division is widely used. 

2. Method of transition to the metalanguage. By a method similar to the 
second method described in § 76, we can go from a statement a concerning 
an event to the metalinguistic statement P(a) — p, according to which the 
probability of the first statement is = p. This method is often used. It is 
necessary, in particular, when we must consider the weight of statements 
that cannot be posited because their weight is too low, or when the weight 
of a posit becomes relevant. The whole calculus of probability must be incor¬ 
porated in this method when probabilities are regarded as properties of indi¬ 
vidual statements, not of events, since the assertions made in this calculus 
concern probabilities. 

It should be noticed, however, that this method requires the use of the first 
method so far as probability statements—statements of the form P(a) — p — 
can never be proved to be strictly true. Such statements are maintained as 
posits of the second level (see § 89). When we regard such statements as true, 
we use the method of division. 

3. Method of reduction to two-valued elements. This method is identical with 
the frequency interpretation. We substitute here propositional sequences for 
the individual propositions within the P-symbol, and regard a probability 
as a frequency of two-valued statements. The ultimate elements of this logic, 
therefore, are two-valued propositions. Using this method, however, we have 
abandoned the logic of weights, and so have omitted a probability evaluation 
of the single case. We have thus returned to the logic of propositional se¬ 
quences that was developed in § 79. The use of this method, therefore, is 
restricted to cases in which only large numbers of objects are considered, as 
in all statistical investigations. 

This method, like the second, requires a simultaneous use of the method 
of division. When we regard the individual statements of statistics as true, 
we can do so only in the sense of an approximation, since all that can be 
asserted is a high weight; the same holds for the statement about statistical 
frequencies when the statement is meant to include future observations. 
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In contradistinction to the second method, the third method is capable of 
two linguistic interpretations, as was explained in § 73. We can regard it, 
like the second method, as a transition to the metalanguage; then the expres¬ 
sion (6) is meant to have the form (5, § 77), and thus refers to propositional 
sequences. But we can also regard (6) as referring to sequences of events. 
The object language is two-valued in both. In the first, it is so because the 
ultimate elements of the object language are two-valued propositions the fre¬ 
quency relations of which are expressed in (6). In the second, it is two-valued 
because it states frequency relations holding for events in a two-valued 
language. 

§ 82. Derivations and Tautologies in Probability Logie 

The method of derivation developed for the calculus of probability can be 
transferred to probability logic. Since the method deals with formulas con¬ 
taining the P-symbol, however, it represents a method of derivation, not 
within the object language of probability logic, but within the metalanguage 
corresponding to it, and may be called an indirect method of derivation. Its 
advantage consists in the fact that, since the metalanguage of probability 
logic is two-valued, the procedure of derivation is identical with the derivative 
methods of two-valued logic. The presentation of a direct method of deriva¬ 
tion will be postponed to a later section (§ 83). 

For the analysis of the indirect method, the derivable formulas can be 
divided into two categories. The first includes formulas stating that the 
probability of a certain expression is = 1, for example, 

P(a V d) = 1 (1) 

In the second category are formulas expressing relations between probability 
values, for example, 

P(b) = P{a) • l>(a,b) + [1 - P(a)} • P(d,6) (2) 

which corresponds to the rule of elimination (2, § 19), resulting from it by 
the omission of the first term A and a renaming of variables. 

Although the derivation of the formulas is carried through in the two¬ 
valued metalanguage, there is a certain difficulty connected with it. As long 
as formulas like (1) and (2) are derived in the calculus of probability, not 
only the language of the derivation but the language of the symbols inside 
the parentheses, that is, the object language, is two-valued. This permits 
the use of logical transformations inside the parentheses; for instance, in the 
treatment of the expression P(a.b Vd.6) we are allowed, applying the rule 
of replacement, to write the expression in the form P(b). This replacement 
is used in the derivation of (2). When, however, the derivation refers to 
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probability logic, the expressions inside the parentheses are subject to the 
rules of probability logic; the signs of the propositional operations are then 
defined by the truth tables of probability logic, and we are not allowed, 
without a specific proof, to apply logical transformations to the expressions. 
Derivations in the metalanguage of probability logic, therefore, although 
performed in a two-valued language, are not legitimate until a proof of the 
applicability of logical transformations inside the parentheses is given. 

Turning to the construction of this proof, we shall show that the object 
language of probability logic admits of the same logical transformations that 
are used in two-valued logic. For this purpose the problem of tautologies must 
be studied first. In probability logic, tautologies are defined as formulas that 
have the probability 1 for all probability values of their constituents. When 
we set up the rule that formulas of the probability 1, and only such formulas, 
can be asserted, tautologies will constitute the assertable formulas of prob¬ 
ability logic. 

We shall first consider only derivations of formulas of type (1). It will be 
shown that every tautology of two-valued logic is, at the same time, a tau¬ 
tology of probability logic. Such formulas will be referred to as tautologies of 
the first kind, since, as we shall see later (§ 83), probability logic includes a 
second kind of tautologies that have no analogues in two-valued logic; such 
tautologies are constructed from formulas of type (2). It will be shown also 
that tautologies of the first kind may be used for transformations of the 
expressions inside the P-symbol. This result guarantees that for every formula 
that is derivable in the calculus of probability there exists an analogue in the 
metalanguage of probability logic, including formulas of type (2). 

To illustrate the problem, consider the formula 

P(a.a s= a) = 1 (3) 

When we regard the equivalence sign and the period sign as defined by the 
two-valued truth tables, the formula is demonstrable in the calculus of prob¬ 
ability. For this purpose we would have only to rewrite the axioms of the 
calculus by the omission of the first term A (which means, in the frequency 
interpretation, that we restrict our consideration to sequences that are com¬ 
pact in A). Since the probability of every compound expression can be written 
as a function / of the probabilities of the elementary expressions, and since 
the probability of every two-valued tautology is = 1, we shall find that / is 
identical with 1 for all probability values of the elementary expressions. 

This inference cannot be applied immediately to (3) when the logical signs 
are regarded as expressing operations of probability logic. Instead, (3) must 
be proved independently; for this proof it cannot be assumed that a. a can 
be replaced by a, since such replacement has not yet been shown to be per¬ 
missible for the operations of probability logic. However, we can give for (3) 
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an independent proof, which shows, at the same time, why the formula 
P(a,a) — 1 must be added to the truth tables of probability logic: 

P(tt.tt) = P(a) • P(a,a) (col. 5, table 11 B) 

P(a,a) = 1 (restrictive condition 2, table 11) 

P(a a) ~ I\a) (4) 

Looking at column 7 of table 11 B (p. 408), we see that (4) is not sufficient to 
establish (3); we must first prove that P(a.a,a) = 1, since P(a.a,a) cor¬ 
responds to the value u of the column. The proof is given as follows: 

P(a,a) = 1 

P(a. b,a .b) — 1 (substitution of a.b for a ) (5) 

P(a.b,a.b) — P(a.b,a) • P(a.b.a,b) (col. 5, table 11#) 

Since the latter product contains two factors that cannot be larger than 1, 
and since the product is = 1 because of the preceding line, each of the two 
factors must be = 1, that is, we have 

P(a. b,a) = 1 (6) 

P(a.b.djb) = 1 (7) 

Substituting a for b in (6), we arrive at 

P(a.a,a) = 1 (8) 

(8) in combination with (4) proves (3). 

By similar inferences it is possible to prove the commutativity and the 
associativity of the “and” and the “or”, but these proofs will not be given 
here. 

However, we shall prove the rule of replacement (see 3, § 5), which for 
probability logic assumes the form: if the formula 

P(b s c) = 1 (9) 

is derivable, that is, a tautology, it is permissible to replace b by c in all 
probability expressions. The result enables us to use the tautological equiva¬ 
lences of probability logic for transformations inside the P-symbol. 

We first prove that if (9) holds we have 

P(b) = P(c) (10) 

Employing the notation 


P(b) = p P(c ) = q P(b,c) - u P(c } b) = v 


(ii) 
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we obtain, from the seventh column of table 11 B, 

1 — p — q + 2 pu = 1 

V + g 

2 V 


u — 


( 12 ) 

(13) 


The second inequality of restrictive condition 1 of table 11 gives the result 


uS~ (14) 

V 

which, by the substitution of (13), leads to the inequality 

V S q (15) 

Since the values of all probabilities are subject to the condition of normal¬ 
ization, that is, must lie between 0 and 1, limits included, we derive from 
u ^ 1, by means of the last column of table 11 B, that 


u = - < 1 

V 


<2 

IIA 

06) 

vq 

Inserting the value u = — in (12), we derive 

V 


1 — p — q + 2 qv — 1 

(17) 

v + q 

V — 

2 q 

Combining (16) and (18), we arrive at 

(18) 

a 

VII 

(19) 

(15) and (19) lead to the conclusion that 


P = 9 

(20) 


a result that proves (10). Although the quantity p occurs in the denominator 
of (13), the relation (20) holds also for the case p = 0. Namely, if p = 0, 
we derive from (13) that also q = 0, since otherwise u could not be g 1. 

We now prove that if (9) holds we have 

(21a) P(b,c ) = 1 P(c,b) = 1 (21 b) 

This results from (13) and (18) with p = q. The proof, however, depends on 
the condition p > 0. 
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Furthermore, we prove that if (9) holds we have 

P(a,b) — P(a,c) (22) 

We first prove that if P(d) — 1 we have also P(a,d) = 1 for every P(a) > 0. 
With the notation 


P(d) = p P(a) — q P(d,a) — u P(a,d) — v (23) 

we derive from the restrictive condition 1 of table 11 that for P(d) = 1 we 
have P(d,a) = P(a). The last column of table IIP then furnishes P(a,d) = 1. 
Applying the result to (9), we derive 


P(a,b = c) = 1 (24) 

In order to apply the truth table 11 to expressions like (24), in which the 
propositional operation stands in the second place of a comma expression, 
we introduce the following addition to the rule of substitution: 

Rule a. If a relation between probabilities P(x i), P{x 2 ), . . . , is deriv¬ 
able, it is permissible to substitute a y Xt for every x iy provided that the same 
variable a is used in all expressions. Expressions of the form a(x,y) resulting 
from such substitutions are replaced by the form a.x,y. 

Applying this rule, and using the notation 


P(a,b) = p P(«,c) = q P(a.b,c) = u P(a.c,b) — v (25) 

we can now write (24) in the form (12), since we can insert the first term a in 
column 7 of table IIP. Applying to (12) the same methods as above, we 
derive (22). 

By means of the last column of table 117^ we now derive from (22) that if 
(9) holds we have 


P(M = P(c,a) 


(26) 


This proof, however, is bound to the conditions P(b) > 0 and P(c) > 0. 

Employing the results obtained, it is now easy to show that if (9) holds 

we have 7 , , , 1 

P(d.b = d.c) = 1 (27) 


P(d V 6 s d V c) = 1 


(28) 


The proof need not be given here. In a similar way, all further conditions for 
the rule of replacement are derivable. 1 

The proofs so far have been restricted to the condition that probabilities 
of expressions occurring in the first term, like P(b) in (21a), are > 0. It will 
now be shown that the results can be made independent of this condition. 
The proof is given by showing that when we apply the frequency interpre¬ 
tation the condition can be dropped. In the formal system of probability 
logic, then, we shall introduce a specific rule stating that the condition can 


1 ESL, p. GO. 
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be dropped; the proof guarantees that the formal system so extended still 
admits of a frequency interpretation. 

When we apply the frequency interpretation to finite sequences, a prob¬ 
ability value P{A) — 0 can occur only if the class A is empty. (We write 
capitals because we are now dealing with classes.) For this case we have set 
up the rule that P(A ,/>) is not uni vocal, but has all real numbers as its value. 
A relation derived for P(A ) > 0 will, therefore, hold also for this case, since 
the value of P(A,B) determined by the relation will then certainly be among 
the values of P(A,B). Only for infinite sequences must we make an exception, 
since for them P(A) can be = 0, although the class A is not empty (only the 
limit of the frequency need be — 0). For a class A of this kind, for instance, 
P(A,D) need not be = 1 when P{D) = 1. 

It is not necessary, however, to consider such cases. All formulas that are 
derivable must hold for both finite and infinite sequences; this follows because 
the same holds for the formulas given by the truth tables of probability logic. 
In particular, if a formula P(D) = 1 is derivable, it holds strictly for every 
n of a sequence, since the formula holds for all probability values of the con¬ 
stituents of D and thus is true whatever be these value's for the n considered. 
The formulas consist in the statement of an equality of certain frequencies. 
Now if the equality holds for every initial section of the length n of the 
sequence—and this follows from its validity for finite sequences—it must 
hold also for the limit of the frequency for n oo. It is therefore impossible 
to derive a formula P(A,B) = . . . that is false for infinite sequences when 
P(A) = 0. In order to express in the formal system of probability logic this 
result derived from the frequency interpretation, we introduce the rule, re¬ 
turning to small letters: 

Rule p. If a formula is derivable for P(a) > 0, it holds also for P(a) = 0. 

As a consequence of this rule, the relations (10), (21a), (21 b), (22), (24), 
(26), (27), (28) are valid for all probability values, on the condition that (9) 
is a derivable formula (and not only a true formula). 

Using the rule of replacement, we can now T derive all tautologies of the first 
kind of probability logic, tautologies that are identical with tautologies of 
two-valued logic. They are derived in the metalanguage, in the form of a 
statement that the probability of the formula is = 1. In the process of deriva¬ 
tion we can use the tautologies already derived for logical transformations. 
It is easily seen that in this way every tautology of two-valued logic is made 
a tautology of probability logic. 2 

2 For the treatment of compound expressions like P{a^b, avfr) the following method is 
used. The truth tables allow us to divide the expression after the comma, that is, to write 
the probability considered as a function of P(aib,a), P(a?b,b) and P([a=>b].a,b). The last 
column of table 11# then permits us to determine these probabilities as functions of the 
reversed ones. We thus arrive at probabilities with a single term before the comma and a 
compound term after the comma. The latter term is then divided by the method used before. 
By repetition of these methods the compound probability is broken down to probabilities 
containing only single terms before and after the comma. 
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It should be realized, however, that, although every tautology of two¬ 
valued logic supplies probability logic with a tautology of identical form, the 
meaning of the two tautologies is not the same. The corresponding signs of 
propositional operations have different meanings, since they are defined by 
different truth tables. Similarly, the fact that the truth values of proposi¬ 
tional variables have different ranges leads to a difference in meaning. The 
meaning of a logical formula cannot be separated from the semantical rules 
holding for the logic to which the formula belongs; though these rules are 
statable only in the metalanguage, they control the meaning of the object 
language' formula, being implicitly included in its symbolic form. The rela¬ 
tion between probability logic and two-valued logic, therefore*, must be stated 
as an ise)morphism rathe*r than as an identity of parts; the tautologies of two- 
valued logic are isomorphous to a subclass of the tautologies of probability 
logic. Only whe*n the range of probability values is restricteel to the values 
0 anel 1 will the tautologies e)f this subclass assume the* meaning of the cor- 
responeling two-valued tautologies. In other words, two-valued logic is a 
special case of probability logic resulting for a certain restriction of its seman¬ 
tical rulers. 

The eierivation of formulas of type (2) can now be carried through easily. 
In this eierivation we) can use the tautologies e>f the first kind for lexical 
transformations inside the parentheses; furthermore*, we shall use rules a 
and p. It is e>bvious that in this way every formula of the calculus of prob¬ 
ability is derivable* in identical form in the metalanguage of probability logic. 

So far, the* metalanguage has been useel for both the derivation anel the 
assertion of tautologies. It is easily seen that the assertion, at least of tau¬ 
tologies of the first kind, can be transferred to the object language if the rule 
is introduced: 

Rule of assertion. A formula of the probability 1 may be asserted, 
except for the case that it has the probability 0 simultaneously. 

The probability 1 thus takes over the function of the truth of two-valued 
logic, although the first concept is slightly wider than the second. The rule 
permits us to assert all tautologies of two-valued logic unchanged within the 
frame of probability logic. For instance, we may write the tautologies 

a V a 

a V b D a (29) 

The way of writing does not indicate the probability frame, which is under¬ 
stood. 

A tautology of the first kind that does not occur in two-valued logic is the 
formula 

a,a ( 30 ) 
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which holds because of the second restrictive condition of table 11 (p. 408). 
Exception is to be made for the case P(a) — 0, for which (30) is not assert- 
able because of the indeterminacy arising for the comma operation. An 
assertable formula containing a comma may be connected with another 
assertable formula by means of an “and”. For instance, the formula 

[a V a].[a,a] (31) 

is a meaningful expression for P(a) > 0 and has the probability 1. 

§ 83. The Quantitative Negation 

The assertion of tautologies of the second kind in the object language of 
probability logic requires some further technical means, which at the same 
time make it possible to carry through the process of derivation in the object 
language. In order to develop these means, a way must be found to express 
the degree of probability, or the weight, within the object language. This 
aim can, in fact, be reached. For this purpose a specific instrument, namely, 
a quantitative negation , will be constructed. 

In two-valued logic the negation serves as an instrument to express truth 
values in the object language. Instead of saying, in the metalanguage, the 
sentence a is false, we say, in the object language, d. By means of the nega¬ 
tion we thus coordinate a true sentence to a given false sentence, and, asserting 
the true sentence, we express the fact that the original sentence is false. 

A similar method is used in multivalued logics, which for such purposes 
possess a cyclical negation. For example, in three-valued logic 1 there are three 
truth values fo, h, the “highest” of w hich, say f 3 , may be regarded as truth. 
The cyclical negation, written as ~cr, is defined by the truth table 12. As 
in tw r o-valued logic, only statements having the highest truth value are assert¬ 
able. When we w r ish to say that a statement a has the truth value t h usually 
interpreted as falsehood, we write 

^ a (1) 

and when w r e wish to say that a has the middle truth value fe, we write 

^ ^ a (2) 

Thus the rank of the truth value is identical with the number of negation 
signs placed before the proposition. The truth value U is given by 

^ ~ ^ a (3) 

This, however, is the same as a; the three negation signs in (3) thus can be 
canceled. 

1 See, for instance, the presentation of three-valued logic in II. Reiehenbach, Philosophic 
Foundations of Quantum Mechanics (Berkeley, 1944), § 32. 
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TABLE 12 

Truth Table of Three- 
Valued Cyclical Negation 


a 


h 

<2 

h 

tl 

U 

h 


This procedure can be used for any ??-valued logic. Probability logic, how¬ 
ever, has an infinite number of truth values arranged in a continuous scale, 
and the cyclical negation must therefore be constructed as a quantitative 
negation. We write this negation by the addition of a numerical variable [tv] 
in half-brackets before the sentence, and define it by the truth table 13. 


TABLE 13 

Truth Table of Quantitative Negation 
in Probability Logic 


a 

\w}a 

V 

P - W + dp-w 


( -j- 1 for p — w ^ 0 

5 P -w - j 0 for 0 < p — w < 1 

{ — lforp — w = 1 


The negation variable w is a real number between 0 and 1, limits included. 

Table 13 may be illustrated by figure 27. The 
range of probability from 0 to 1 is indicated by 
the circle, running clockwise. The sentence a 
has a probability p, indicated by its position 
on the circle. The negation w runs counterclock¬ 
wise. For w < p the sentence he! a has a smaller 
probability. For w = p the probability of [w \ a 
jumps diseontinuously to 1, as a consequence of 
the rules for the 5-symbol set up in the table; 
and for w > p the sentence \w\ a has a proba¬ 
bility greater than P(a). 

By means of this negation we can coordinate 
a sentence of any degree q of probability 2 to a 
given sentence of the probability p. When we choose, in particular, w = p 
for the negation variable, the coordinated sentence has the probability 1. 

2 With the exception of the value q = 0, which can be assumed only for p = 1 and w = 0. 
For p y however, the 0-value is not excepted; that is, to a sentence of the probability 0 we 
can coordinate a sentence of any other value. 



Fig. 27. Diagram of quantitative 
negation. 
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It is easily seen from the table that this is the only case in which q = 1, 
that is, the statement [w]a has the probability 1 if and only if w = P(a). 
We see, furthermore, that a negation to the degree w — 1 leaves the truth 
value of the statement unchanged. A negation to the degree w = 0, in gen¬ 
eral, also leaves the truth value unchanged, except for the case p = 0 or 
p = 1, where the negation reverses the truth value. 

We now apply the rule of assertion (§ 82), according to which propositions 
of the probability 1, and only those, may be asserted. The quantitative 
negation, consequently, allows us to state the degree of probability in the 
object language. In order to express the metalinguistic sentence P(a) — p, 
we write in the object language the true sentence 

iyi« (4) 

The symbol |"//| before the sentence has the same function as the number of 
negation signs in the expressions (1) and (2). Because of the continuous 
character of the truth scale, however, we deny a sentence to a certain degree. 
We state a true sentence by denying the sentence a to the degree p. 

The sign fwl can be put before compound sentences also. Thus the mean¬ 
ing of M (a Mb) is determined by the truth table 13 when we put there 
a Mb for a. Similarly, the meaning of [V1 («,&) is determined by table 13. 

Although the quantitative negation was introduced in the logic of weight, 
a frequency interpretation of this operation can be given. In a propositional 
sequence containing the sentences &i, 6 2 , b :i , and so on in repeated occurrence, 
the negation determines for every sentence whether it satisfies the sentence 
b i, namely, it determines all sentences b 2} b$, and so on, as being of this 
kind. In contradistinction to this procedure, the quantitative negation does 
not determine individual sentences as being of the form [w] hi. We are free, 
rather, to regard any sentence of the form b h b 2 , b 3 , and so on, as being of 
the form \w\ b h the only condition being that the total number of sentences 
Mb, conform to the degree of probability supplied by table 13 (p. 421). 
If P(FV]&i) = 1, all sentences of the sequence will have the form [ni] b,; or, 
for infinite sequences, at least so many sentences that the limit of the fre¬ 
quency is = 1. 

Because of the arbitrary distribution of the sentences satisfying [vf\ b, in 
the sequence, the value P(a } [w]bi) is not determined in terms of P([w]bi), 
but must be given independently. Similarly, the value P([w\bi, c) is inde¬ 
pendent of P([V|6i). For the frequency interpretation of the expression 
P{ [w] (a,6)) we use only the subsequence selected by a, that is, we regard 
only sentences chosen from this subsequence as having the form [V] (a, b). 
This means that we count the frequency of P([w~\ ( a,b )) in the subsequence, 
like that of P(a,b). The reason for this interpretation of P( \w 1 (a, b)) is that 
this probability is a function of P(o,6), according to table 13. 
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The interpretation makes clear that the quantitative negation constitutes 
a negation relative to a certain sequence as reference sequence. If a prob¬ 
ability expression contains no comma, the reference sequence is given by the 
main sequence tacitly assumed for all probability expressions; if it contains 
a comma, the sequence 1 given by the first term of the probability expression 
is the reference sequence. The application of the quantitative negation to 
individual sentences in the logic of weights, therefore, has the same fictitious 
character as the application of degrees of probability to individual sentences. 

Using the quantitative negation for the indication of the degree of prob¬ 
ability, we can now transfer all derivative' methods of the metalanguage into 
the object language. In order to carry through this program, we must first 
introduce tautologies of a second kind, different from those that are identical 
with two-valued tautologies and that were discussed in § 82 under the name 
of tautologies of the first kind. 

Tautologies of the second kind contain a quantitative negation and con¬ 
stitute the analogues of P-formulas of the second category derivable in the 
metalanguage of probability logic, like (2, § 82), that is, of formulas stating 
the equality of certain degrees of probability. For instance, the formula 

P(a) + P(d) = 1 (5) 

can be written in the form 

[/>(«) = p) D [P(d) = 1 - p] (6) 

The implication is the two-valued implication of the metalanguage. Using 
the quantitative negation, we can construct the object-language equivalent 
of (6) in the form 

iy|a->- [l - p] a (7) 

We must discuss the implication indicated by the arrow. It is not per¬ 
missible to identify it with the implication defined in the sixth column of 
the truth table 117? (p. 408) of probability logic. For obvious reasons we 
can also establish in (7) an implication going from right to left; the two 
implications, then, would constitute an equivalence in the sense of the seventh 
column of table 11/?. But it is easily seen that the expressions on the two 
sides of (7), in general, do not have the same probability value. This follows 
when we apply table 13 of the quantitative negation, which shows that 
equal probability values result only for P(a) = p ) or, what is the same, for 
P( Tpl a) = 1. Now we want to regard (7) as a tautology, as a formula having 
the probability 1 for all probability values P(a). In other words, we want 
(7) to hold also for values other than P( Tpl a) = 1. It follows, then, that the 
implication in (7) requires a definition different from that given in table 11 B. 

For these reasons we introduce the truth table 14, which defines the alter - 
native implication , expressed by the arrow. This implication corresponds to 
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the two-valued implication in that it is true whenever the implicans is not 
completely true (third row of the table). It is obvious that, because of this 
property of the implication, formula (7) will have the probability 1 for all 
values of P([p]a), that is, it will be a tautology. What (7) asserts, in fact, 
is only that the probability of the implicate must be = 1 when the prob¬ 
ability of the implicans is = 1, whereas any assertion concerning other prob¬ 
ability values is omitted. For this reason the existence of the inverse impli¬ 
cation in (7), going from right to left, does not lead to an equivalence; the 
double-arrow implication states the equality of truth values only for the 
value 1, leaving open any statement about other values. 


TABLE 14 

Truth Table of Alternative Implication 
in Probability Logic 


P ( a ) 

P ( b ) 

P(a -* b ) 

1 

1 

1 

1 

< 1 

0 

< 1 

^ 1 

1 


The name alternative implication has been chosen in order to indicate that 
the implication is capable only of the two truth values 1 and 0, as shown by 
the third column of table 14. To use an alternative implication in tautologies 
of the second kind seems appropriate because such tautologies correspond to 
formulas of the two-valued metalanguage, like (6). We thus arrive at formulas 
of the object language that can only be true or false. The formula 

[p]a->-[p]a (8) 

for instance, is false for all values p ^ \ and P(a) = p , and it would not 
make sense to regard this formula for such values as probable to a certain 
degree. That it is possible to construct two-valued formulas within a multi¬ 
valued logic is known from the study of three-valued logic; 3 the alternative 
implication of table 14, in fact, corresponds to an operation of the same 
name employed in three-valued logic. 4 

By means of the alternative implication we can transcribe all formulas of 
the second category derivable in the metalanguage of probability logic, like 
(2, § 82), in formulas of the object language. The transcribed formulas con¬ 
tain a quantitative negation and represent tautologies of the second kind. 
Thus, formula (2, § 82) supplies the tautology: 

iyia. \u\ (a, b) . |Y1 (d,6) \pu + (1 - p)v 1 b (9) 

* See H. Reichenbach, Philosophic Foundations of Quantum. Mechanics (Berkeley, 1944), 
pp. 154, 159. * Ibid., p. 151. 
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For this formula, of course, the arrow cannot be reversed, as in (7). There 
exist, however, inverse formulas—formulas resulting from (9) by exchanging 
the right side with one of the units of the left side. For the derivation of such 
formulas the rule of existence is to be assumed, as was explained with respect 
to (2, § 13). We shall not go into the formalization of this procedure, since 
we did not do so for the calculus of probability. Note that the comma expres¬ 
sions in (9) are meaningful only if the probability of the reference term does 
not vanish. 

The existence of tautologies of the second kind shows that probability 
logic is richer in tautologies than two-valued logic. Every tautology of two¬ 
valued logic is isomorphous to a tautology in probability logic, but not vice 
versa. 

After the construction of tautologies of the second kind, we can turn to 
the consideration of the direct method of derivation. The rules of derivation, 
to be used for derivations in the object language, are the same as those of two¬ 
valued logic. The rule of substitution obviously holds; it was demonstrated 
above that the rule of replacement is applicable. The rule of inference (modus 
ponens) holds in the two forms 


a 

a b (10) 

~T~ 

a 

a D b (11) 

b 

The justification of (10) follows from the definition of the alternative impli¬ 
cation in table 14 (p. 424); if P(a ) = 1 and P(a-*-b) = 1 , we must have 
P(b) = 1 . The implication in (11) is the implication defined in column 6 of 
table 11 B (p. 408) of probability logic. It is easily derivable from the value 
1 — p + pu of this column that u — 1 when the premises of (11) have the 
probability 1; this leads to P{b) — 1. 

Besides the fundamental rules (10) and (11), every tautological implication 
can be employed for the construction of a secondary rule of inference. Thus 
(9) leads to the inferential schema 


Tpl a 

M (a, b) 

M (a, b) 

r pu + (i - p)t>"]6 


( 12 ) 
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This schema is the object-language analogue of the inferential schema of the 
metalanguage: 

P(a) = p 
P(d,b) = U 

P(a,b) = v _ 

P(b) = pu + (1 — p)v (13) 

The inferential schema (12) can be regarded as a generalization of the 
modus ponens for premises that are not true but have only certain degrees 
p,u y v of probability. A special form of (12) results for p — 1; we then have 

a 

M (a,b) (14) 

Mb 

This schema resembles more closely the modus ponens . 

I do not wish to suggest the notation in terms of the quantitative negation 
for practical use. The P-notation in the metalanguage is technically superior 
for derivations. But it is important that probability logic can be made com¬ 
plete, so that it includes direct methods of derivation. 
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§ 84. The Various Forms of Induction in 
Empirical Science 

The analysis of probability statements presented in chapter 9 has led to the 
result that the meaning of the word “probable” is always reducible to a fre¬ 
quency meaning. This result includes the logical interpretation of probability, 
although by the concept of weight this interpretation acquires a certain inde¬ 
pendence and assumes the form of a logic in which the reference to frequencies 
is not explicitly stated. The problem of the assertability of probability state¬ 
ments finds a simple solution so far as the laws of probability are concerned, 
since the laws are made tautological by the frequency interpretation. There 
remains only one problem, which is inherent in the frequency interpretation: 
the problem of the ascertainment of the degree of probability, which has 
been shown to be the same as the problem of induction. 

The word “induction” is usually employed in a sense more comprehensive 
than the one so far envisaged in this work. The rule by which we infer that 
the frequency observed in an initial section will persist for the whole sequence 
is regarded as a special case of induction, often called induction by enumeration. 
Postponing the discussion of induction by enumeration, we shall consider 
first the other forms of induction employed in scientific method. 

Francis Bacon, who was the first to emphasize the need for inductive 
inference in scientific method, regarded induction by enumeration as a poor 
instrument of prediction and attempted to devise inductive methods superior 
to it. His tables of presence, of absence, and of degrees were constructed 
for that purpose. More than two centuries later, they were adopted by John 
Stuart Mill, w r ho reformulated them as canons of induction and believed 
that in them he had constructed the ultimate form of scientific inference. 

When we consider these improved forms of induction critically, we find that 
they contain three additions to induction by enumeration. The first is a 
trivial use of deduction : Bacon’s table of absence, or Mill’s canon of difference, 
calls for a collection of instances in which a factor A is not connected with a 
factor B —instances, therefore, which prove that the conclusion “All A are B ” 
is not true. Here deduction is employed in a trivial form to rule out imper¬ 
missible inductive conclusions. The method, of course, is applicable only 
when classical induction (§ 67) is concerned; it cannot be applied to statistical 
induction. The second addition is the emphasis on large numbers , illustrated 
particularly by Bacon’s long tables on the phenomena of heat. The third 
addition is clearly stated by Bacon, but not by Mill, though it obviously 
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underlies his theories, too: the rule that the instances collected must constitute 
a fair sample . Bacon 1 refers to it in his story of the shipwreck: 

It was a good answer that was made by one who when they showed him hanging in a 
temple a picture of those who had paid their vows as having escaped shipwreck, and would 
have him say whether he did not now acknowledge the power of the gods—Aye, asked he 
again, but where are they painted that were drowned after their vows? 

The disregard of the rule of the fair sample may be called the fallacy of biased 
statistics. Mill assumes the rule, in particular, in his canon of concomitant 
variations, for such variations can establish an inductive relationship only 
when they are taken at random. In Charles Peirce’s investigations of induc¬ 
tion, the rule of the fair sample plays an important part. 

Another method of improving induction—though it is not mentioned by 
Bacon, Mill, or Peirce—may be called cross induction. For example, the 
inference that, because all the swans so far observed have been white, all the 
swans in the world are white, has turned out to be a bad inference, since 
black swans were found in Australia. p]ven before the discovery of black swans, 
however, the conclusion could have been questioned by the following con¬ 
sideration: it is a general rule for biological species that color is not a constant 
characteristic within a species; therefore, one should not have inferred that 
all swans are white. 2 Here an inference of induction by enumeration, applied 
to one sequence, is criticized by means of another inductive inference that 
refers to sequences as elements. 

The inference may be illustrated by the schema 


B 

B 

B 

B 

B 

B 

B . . . 

C 

C 

C 

C 

C 

C 

C . . . 

s 

s 

s 

s 

s 

s 
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For each of the first horizontal sequences an inductive inference leads to the 
result that both positive and negative instances will always occur; for the 
last horizontal sequence the inductive conclusion is that only positive instances 
will occur. Making an inductive inference in the vertical direction, for which 
each horizontal sequence is an element, we infer that, since all the other 
sequences show both kinds of instances, the last horizontal sequence will do 
the same if it is sufficiently continued. We thus cancel an individual inductive 
inference by means of a cross induction. 

The schema is frequently applied. It is given in the inference that, although 
carbon has not yet been melted, it will melt at higher temperatures because 


1 Novum organum scientiarum (published 1620), aphorism 46. 

* Andr6 Lalande, . . . Les Theories de Vinduction et de Vexperimentation (Paris, 1929). 
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all other substances do so. 3 The inference that all men are mortal, too, has 
this form when we regard the lives of individuals as sequences of days; that, 
the sequences of days of persons now living will terminate in a day of death 
is inferred by a cross induction. 

Whereas all these* forms of inference dearly show their relation to induction 
by enumeration, there is another form of induction that, at first sight, seems 
to be of a different nature. It is based on the application of causal explanation 
and may therefore be called explanatory induction. It consists in the inference 
from certain observational data to a hypothesis, or theory, from which the 
data are derivable and which, conversely, is regarded as being made probable 
by the data. Such inferences are used in the establishment of scientific theories. 
They are applied also in the investigations of a detective who, in the search 
for the perpetrator of a crime, constructs an explanation on the basis of 
observational findings. In the physical sciences the explanation is (tarried 
through by means of mathematical methods, a fact showing that the inference 
from the hypothesis to the data may be 1 of the deductive type. The advance¬ 
ment of science in the last centuries, in fact, is due to the application of 
explanatory induction; and the critics of John Stuart Mill were right when 
they insisted that no theory of induction is satisfactory unless it includes 
an account of explanatory induction. 

The various forms of induction presented include a common feature: they 
all constitute inductive inference's made in an advanced state of knowledge 
(see § 70), that is, inferences based, not on new observational data alone, 
but also on the results of previous inductive inferences. In fact, it virtually 
never happens that an inductive inference is made in isolation; the success 
of induction is based on a method of concatenation , which combines many 
inductive inferences in a network of inference. This fact has usually been 
overlooked in the? theories of induction. If a great deal of knowledge is taken 
for granted, the inductive inference assumes particular forms that are justifi¬ 
able only on the basis of tacit assumptions. Presuppositions of the specific 
form, however, were overlooked, and theories of induction were constructed 
that took one or the other specific form, without mention of the conditions 
of its applicability, as representative of the inductive inference in general. 
The resulting theories illustrate the fallacy of incomplete schematization 
(§§21,08). 

So far as explanatory induction is concerned, a fallacious interpretation 
was discussed at the end of § 21, in the analysis of the so-called inference by 
confirmation . This inference has also been called the hypothetico-deductive 
method, the term being meant to indicate the deductive relation from the 
hypothesis to the observational data. The fallacy consists in the belief that 
an inductive relation holds for the reversed direction, or, more precisely 


8 See EP, p. 365. 
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speaking, that the implication a D b entitles us to regard a as probable when 
b is given. 

Explanatory induction must not be interpreted in the same sense. Only for 
a superficial consideration does explanation have the form of the* hypothetico- 
deductive method, or of the inference by confirmation. In deeper analysis 
it reveals a much more complicated structure. Explanatory induction must 
be regarded, not as an inference' in its own right, but as a combination of 
probability inferences such as are formulated by the 1 rule of Bayes. The 
complicated nature of the inference is made clear by the fact that the inference 
is applied only when much more is known than the occurrence of the conse¬ 
quences of an assumption; without an estimate of the antecedent probabilities 
the inference is never made. 

For scientific theories, estimates of the antecedent probabilities are often 
given in the form of considerations about the plausibility or “naturalness” 
of a theory, that is, by arguments that make the theory credible 4 , independent 
of the observed confirmation. For instance, Newton's law of gravitation has 
a high antecedent probability because of the term r~ in the denominator, 
which corresponds to the decrease 4 of a force spread over the surface 4 of a 
sphere 4 . The law thus appears as an expression of the 4 threr-dimemsionality 
of space, since only in a thre'e-dimensional space does the surface of a sphere 
diminish with the square of the 4 radius. 

For detective) cases the 4 antee*e 4 elent probabilities appear in the form e>f a 
discussion of the motives erf the crime). Simple examples, in which the 4 com¬ 
putation of the probability of the explanation can actually be carried through, 
were given in the 4 exercisers appemeled to chapter 3. Another illustration is 
found in the analysis of a problem of circumstantial evidence in § 47, in 
which the probability of an insurance murder is computed numerically. The 
misunderstandings of the inference by indirect e 4 vielene*e) and its interpretation 
as an inference by confirmation may perhaps be) psychologically explained 
as an oversight of the 4 role which the antecedent probabilities play in the 
inference. These probabilities are e 4 asily overlooked because they often nexd 
not be known otherwise than in the form of crude estimates, while the result 
of the inference can be very prerise 4 (see §§ 02, 70). 

The thesis may be generalized: all inductive inferences that do not have 
the form of induction by enumeration must be construed in terms of the 
theorems of the calculus of probability. In fact, the calculus of probability 
contains the key to a theory of induction in advanced knowledge. Philosophers 
who believe that a philosophical theory of induction is to be developed ineki- 
pendent, of the statistical methods employed in the science's make the mistake 
of overlooking the existing mathematical methodology: all the) questions 
concerning induction in advanced knowledge, or advanced induction , are 
answered in the calculus of probability. While logicians were vainly looking 
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for an inductive logic that could account for scientific method, mathematicians 
constructed a mathematical system that covers all forms of probability infer¬ 
ence and thus of scientific inference—a system that can even be transcribed 
into a system of logic, as was shown in chapter 10. The logician of our day 
who is aware of the fallacies of the philosophy of rationalism abandons all 
attempts at a construction of an inductive logic from pure reason. The induc¬ 
tive method presented by the; calculus of probability is a much more powerful 
instrument than any substitute devised under the name of rational belief; 
moreover, it admits of an empiricist interpretation that rejects all forms of 
synth(;tic sclf-<vvidonce. 

The probability character of explanatory induction is also indicated by the; 
fact that the merging of scientific theories of different domains leads to an 
increase in reliability. Thus Newton's combination of Galileo's theory of 
falling bodies and Kepler's laws of planetary motion led to a theory of gravita¬ 
tion that was superior in reliability to either of the theories included in it. 
Some logicians have regarded the unification of theories as the expression 
of a tendency to logical elegance, or economy. Such an interpretation, how¬ 
ever, seriously misrepresents the nature of scientific method. The unification 
of theories is an instrument for connecting scientific results in such a way 
that the combination obtains a higher probability than each of its parts taken 
separately. The schema of these; inferences can be‘ understooel when it is inter¬ 
preter! in terms of the theorems of the' calculus of probability. Such an analysis 
makes clear that the theory of advanced induction is identical with the theory 
of probability. 

Since the axiomatic. construction of the calculus of probability leads to 
the result that, when the frequency theory is assumed, all probability infer¬ 
ences are' reducible to deductive infereme-es with the' addition of induction 
by enumeration, it follows that all inductive inferences are reducible to 
ineluction by enumeration. This thesis was at the basis of Hume’s theory 
of ineluction—themgh he; thought only of classical indue*tion—but he hael no 
proof for it. The proof can be given only by the axiomatic construction of 
the calculus of probability. 

The the'sis, furthermore, must ne)t be oversimplifieel to the statement that 
all inductive inferences can be construed eliree*tly as induction by enumera¬ 
tion; the mluctiem is possible only ineiireetly through the reducibility of the 
axioms of probability to the frequency interpretation. The the'sis may be 
compared to a result of axiomatic constructions of mathematics, according to 
which all mathematical operations are reducible to the operation erf additiem 
by one, a reeluction that, too, can be elaimeel only in principle, but cannot 
actually be carrieel through. 

The thesis is further obscured by a confusion between the context erf dis¬ 
covery and the context of justification, if I may be allowed to use certain 
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terms that I have introduced elsewhere. 4 The finding of explanation belongs 
in the context of discovery and can be analyzed only psychologically, not 
logically; it is a process of intuitive guessing and cannot be portrayed by a 
rational procedure controlled by logical rules. Rationalization belongs in the 
context of justification; it can be applied only when given inductive conclu¬ 
sions are to be judged appropriate to given facts. It is in this context that the 
thesis belongs. Testing the relations between given observational data and 
given inductive conclusions is a procedure expressible in terms of theorems 
of the calculus of probability; and the inductive inferences of the test pro¬ 
cedure are, therefore, ultimately reducible to induction by enumeration. 

I do not maintain, by this thesis, that the finding of inductive explanation 
could be achieved bv enumerating observations and simple generalization, 
such as Bacon hoped to attain in his tables; nor do I claim to have bettor 
methods. I refuse to answer the challenge of setting up rules of a logic- of 
discovery. There art* no such rules. Philosophers who believe that induction 
could become a sort of philosopher's stone, supplying methods that auto¬ 
matically transform facts into theories, misunderstand the task of logical 
analysis and burden the theory of induction with an unsolvable problem. 
Like deductive logic, the logic of induction concerns, not the psychological 
process of finding solutions, but the critical process of testing given solutions; 
it applies to the rational reconstruction of knowledge and thus belongs in the 
context of justification, not in the context of discovery. 


§ 85. The Probability of Hypotheses 


The thesis that explanatory induction can be construed in terms of the 
theorems of the calculus of probability and is therefore reducible to induction 
by enumeration is attacked by the argument that tin* probability of hypotheses 
is not interpretable as a frequency. Although the general discussion of non- 
frequency probabilities (§71) covers this case also, and shows that the prob¬ 
ability of a hypothesis, like that of any other single case, must be interpreted 
as a relative frequency, I should like to add some remarks on how the inter¬ 
pretation can be carried through—how, in particular, the reference class of a 
hypothesis is to be determined. 

Scientific hypotheses are all-statements: they assert that for all things of 
a certain kind, at all times and places, a certain relation holds. So we begin 
this inquiry by studying the probability of all-statements. 

When a scientific law is stated in the form of a general implication, sym¬ 


bolized as 


(x)[f(x) 0 g(x)} 


( 1 ) 


the formulation must be regarded as a schematization, introduced because 
4 See EPj pp. 6-7. 
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the logical treatment of all-statements in a two-valued logic is much simpler 
than the use of probability implications within the framework of probability 
logic. What can be proved by inductive methods is only that a probability 
implication of a high degree exists; the transition to the general implication 
is bound to certain conditions the neglect of which leads to paradoxes. 

Such paradoxes appeared in the theory of confirmation when it was applied 
to the establishment of implications like (1), as was shown by C. G. Hempcl. 1 
The theory employs, as confirming cases, observations of the kind f(x) .g(x); 
the larger the number of such cases, it is argued, the better is the general im¬ 
plication confirmed. But since, according to the rule of contraposition 
(6c, § 4), (1) is tautologically equivalent to the form 

0)[<7(x) I)/(i)] (2) 

we must regard as confirming cases observations of the form g(x).f(x) also. 
This consequence seems absurd. Returning to the example of the swans, we 
would have to regard as confirming cases of the statement, “All swans are 
white”, not only observations of white swans, but also of anything that is 
not white and not a swan, for instance, of red flowers. 

The paradox seems to be unsolvable within the theory of confirmation. 
It disappears, however, as soon as probabilities are introduced. For a prob¬ 
ability implication, even of the degree 1, contraposition does not hold, and 
thus the two forms corresponding to (1) and (2) 


(*.)[/(*>) -9- g(*»)] 

1 

(3) 

(Xi)\g(Xi) - 9 - /(*<)] 

(4) 


i 


arc not equivalent. This fact can also be made clear as follows. If (3) were 
manifestly false and we had only a probability implication of a low degree, 
(4) might remain virtually true. The reason is that only a small number of 
things that are not white will be swans. Consequently, if the truth of (4) is 
established to a high degree of probability, (3) need not be true, and we 
cannot use an establishment of (4) as a proof for the validity of (3). From 
the standpoint of probability theory, a general implication represents a 
degenerate case. It can be substituted for a probability implication of a high 
degree only after both the relations (3) and (4) have been established inde¬ 
pendently. Conversely, the use of all-statements must be interpreted as indi¬ 
cating, not that the degree of probability is assumed as strictly = 1 (all that 
empirical evidence can prove is a probability within a small interval 1 — 5), 
but that both the relations (3) and (4) have been verified practically within 

1 A purely syntactical definition of confirmation, in Jour, of Symbolic Logic. Vol. VIII 
(1943), p. 128. 



436 


INDUCTION 


an interval 5 of exactness. The analysis shows that the method of inductive 
verification must be attached to probability statements and not to the schema¬ 
tized form of knowledge in which such statements are replaced by two-valued 
statements. 

Apart from contraposition, a further difference between general implica¬ 
tions and probability implications of the degree 1, or of a high degree of 
probability, is given in the fact that from (A D B) we can derive (A.C D B) 
for every C ; whereas if (A B) holds, with p almost equal to 1, there exist 

V 

always classes C such that (A.C -z-B), where q is a low probability. 2 When 

Q 

we use a general implication as a schematization for a probability implication 
of a high degree of probability, we usually require, therefore, that at least 
no class C be known such that P(A .C,B) is small. If such a class C is known, 
we can derive that P(A.C,B)> P(A y B) [see the remarks following (116, 
§ 19)], and we then reformulate the all-statement by the use of A .C as refer¬ 
ence class, that is, by excluding the known exceptions from the implirans. 
In other words, we require that no rule be known by moans of which excep¬ 
tions to the fill-statement could be predicted. This usage of language makes 
it evident that the use of all-statements in place of probability implications 
of high degrees depends on more conditions than are given by the existence 
of a high degree of probability. 

In this connection, an objection by E. Nagel may be discussed. It concerns 
the question why scientific all-statements are usually conceived as so strictly 
valid that even one exception would be regarded as a sufficient reason to re¬ 
nounce the all-statement. If what is meant by an all-statement is only a high 
probability, occasional exceptions should not be regarded as evidence to the 
contrary. 

This criticism can be answered in various ways. First, if the limits of exact¬ 
ness are narrowly drawn, there will always be* exceptions to scientific all¬ 
statements; that such exceptions are called observational errors does not 
change the fact that the all-statement is not strictly satisfied. Second, it is 
true that for wide limits of exactness, or merely qualitative statements, a 
case of one exception is regarded as incompatible with the all-statement. 
For instance, in scientific language' we would not say that all human beings 
have hearts, if one exception were known. This attitude can be explained in 
two ways. First, the degrees of probability for such all-statements are usually 
so high that one exception, in fact, must be regarded as a noticeable diminu¬ 
tion of the degree of probability. Second, one exception proves that the strict 
all-statement is false, and we dislike using an all-statement as a schematization 
if it is known that the all-statement is false. If a statement is used as a schema- 

2 This is true even if p — 1, though in this case q ^ I is possible only if P(A y C) — 0; see 
(6, § 25). See also the discussion of this peculiarity of probability implication at the end of 
§72. 
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tization, it should at least be compatible with the existing observational 
evidence to assume that the sehematization is verbally true. 

We shall turn now to the question how to discuss schematized statements 
like (1), in the sense of approximations, within the frame of probability 
statements. Before the sehematization, we have, instead of (1), a statement 
of the form 3 

74/te) »{/0*0] = P (5) 


We shall consider also the probability of the statement (5) and thus a state¬ 
ment of the form 


P{PlRx),g(x)] = p ± 8} = qdp (6) 


where dp — 2 8 (for the notation p ± 5 see footnote, p. 4G2). The value q 
is a probability density; for 6 = 0, that is, a precise value p, the probability 
(G) would be = 0. 

We begin with the: discussion of (5). The statement can be transformed 
so that it informs us about the probability of individual implications of the 
form fix) D g(x). To demonstrate the procedure we use the tautological 
equivalence 

(AdB = A.BvA) (7) 


which is easily derivable from the truth table of implication, and have 

P(A D B) = P(A) • P(A,B) + P(A) 

= 1 - P(A) • [1 - P(A,B)\ (8) 

If P(A) = 1, wo have l'(A 3 B) = P(A,B). If P{A) < 1, the value of 
P(A D B) must bo closer to 1 than that of P(A,B), since the brackets in (8) 
then are multiplied by a factor smaller than 1. Therefore we have the general 
inequality 

P(A D B) ^ P(A 1 B) (9) 


The (^quality sign holds only when P(A) = 1 or P(A,B) = 1. If P(AJi) = 1, 
we have also P(A D B) — 1. 

Because of (9) we can always replace a probability implication of the 
degree p by a logical implication, an individual adjunctive implication, with 
the qualification that the probability of the resulting statement is ^ p. 
Thus, when we find that 95% of all swans arc white, we can express the 
result in the form that the probability of the statement, “A swan is white”, 
is ^ 95%. The degree of the probability implication, therefore, appears as 
a lower limit of the probability of the corresponding logical implication. 
This interpretation offers itself when p is close to 1; instead of a probability 
implication, we then apply a logical implication with the qualification that 
the probability of the statement does not quite attain certainty. 

3 We omit the subscript i because in the P-notation the order of the elements x is not 
expressed. 
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We now proceed to the discussion of (6). When p is practically = 1 and the 
conditions of a transition to an all-statement are satisfied, ((>) assumes 
the form 

P[(x)[f(x)Dg(x)}\ — qdp (10) 

From the form of ((>) it is clear that we must employ a lattice for the inter¬ 
pretation of this probability, though the lattice superscripts are omitted 
in (6); the probability of all-statements, consequently, must be defined as 
a second-level probability in a probability lattice. The occurrence of the 
factor dp on the right of (10) shows that the probability of a strict all-state¬ 
ment would be = 0. Only if the all-statement is regarded as admitting of a 
small inexactness dp = 8, such that p ^ 1 — 8, will its probability be > 0. 

With respect to all-statements, therefore, we can distinguish two kinds of 
probabilities. Probabilities of the first level of the lattice, that is, probabilities 
of the form (5), represent probabilities of individual implications comprised 
by the all-statement, to be used when the all-statement is not strictly verified 
but serves only as an approximation. The probability of the second level 
supplies the probability of the all-statement itself. 4 

Newton’s law of gravitation may be used as an example. It states that 
for all bodies, at all times and in all places, the relation 


/ = 


k • 


rnirn-2 


(ii) 


holds, where / is the force of attraction, nil and m 2 the respective masses of 
the bodies, r their distance, and k a constant. The abbreviation x may denote 
a set of individual conditions, including a specification of the bodies involved 
and the time and space coordinates. Then, if (11) holds for the individual 
conditions to a certain degree 8 of exactness, we write <p(x ), otherwise <p(x). 
Assume that the relation (11) has been tested for various positions of the 
planet Mars; we can write the results in the form of a sequence of terms 
<p(x) or <p(x). Doing the same for other planets, the moon, and other tests of 
Newton’s law (for example, Cavendish’s experiment with a torsion balance), 
we arrive at a lattice 


<p(x 11 ) <p(x\ 2 ) <p(x u ) <p(zu) 

<p(x 21 ) < p{x 22 ) <p(Xt 3 ) <p(X2t) . . ( 12 ) 


Each row belongs to one planet or other test object. The negative cases are 
usually said to result from errors of observation; for us they are indications 

4 This treatment of the probability of hypotheses was first developed by H. Reichenbach 
in Erkenntnis\ Vol. V (1935), pp. 274-278. The present treatment, however, includes some 
additions and clarifications. 
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that observation can never strictly establish a general implication, but only 
a probability implication of a high degree. 

We now establish the degree of probability for each row by means of a posit 
based on the inductive rule. Assuming the posits to be true, we count in the 
vertical direction and thus construct the probability of the second level 
holding for the statement that the probability of a row is = p. This prob¬ 
ability, again, is stated in the form of a posit. For the probability of the first 
level the reference class is rather easily constructed. It is the same class 
that is used in implications of the all-statement if such a statement is intro¬ 
duced in the sense of a schematization. For the probability of the second 
level the definition of the reference class is not unambiguous and thus offers 
the usual complications combined with this definition. In principle, however, 
the definition of the reference class follows the same procedure as in all other 
such problems. In this example, the class is constructed by reference to other 
instances where the law applies. The second-level probability, then, expresses 
the probability of an all-statement restricted to one planet. 

In the astronomical tests of Newton’s law, the observation was made that 
the planet Mercury does not satisfy the law to the same degree of exactness 
as the other planets. This example illustrates an exception to an all-statement 
that was formerly regarded as true without exception. Since the exception 
is restricted to one planet, it is regarded as an example in which one negative 
case is sufficient to disprove a physical law. This interpretation is not entirely 
correct. The measurements of the orbit of Mercury cover a large number of 
individual observations, and in schema (12) the exception would be repre¬ 
sented by one row that does not converge toward the same limit as the others. 
But it is at least true that one negative row is regarded as sufficient to 
disprove the all-statement extended in the vertical direction. The law has 
therefore been restricted to planets that are not too close to the sun, that is, 
to the center of attraction. This is an instance of a transition from a reference 
class A to a reference class A . (7, as explained above (see also the discussion 
of the independence of the single case at the end of § 89). That the reference 
class C in which we incorporate the exception is, in this case, assumed to be 
the class of planets near the sun is based on other inductions, including those 
validating Einstein’s theory of general relativity. 

Instead of constructing an individual sequence for each planet, we can 
include in the first horizontal row all observed instances of the law, regardless 
of the individual planet or test object to which they belong. Then the limit 
of the frequency posited for this row determines the probability of the first 
level for the general case, that is, for an individual implication not restricted 
to one planet. In order to define the probability of the second level and thus 
the probability of Newton’s law in general, not restricted to one test object, 
we must construct a reference class by filling out the other rows with observa- 
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tions pertaining to other physical laws. For instance, for the second row 
we can use the law of the conservation of energy; for the third, the law of 
entropy; and so on. The reference class employed corresponds to the way 
in which a scientific theory is actually judged, since confidence in an individual 
law of physics is undoubtedly increased by the fact that other laws, too, have 
proved to be reliable. Conversely, negative experiences with some physical 
laws are regarded as a reason for restricting the validity of other laws that 
so far have not been invalidated. For instance, the fact that Maxwell’s equa¬ 
tions do not apply to Bohr’s atom is regarded as a reason to question the 
applicability of Newton’s or Einstein’s law of gravitation to the quantum 
domain. 

These considerations show that the probability of a hypothesis or a scientific 
theory can be defined in terms of frequencies. Applied to the individual 
hypothesis, the probability assumes the character of a weight; all that was 
said about the use of a weight for statements of single cases holds likewise 
for the weight of hypotheses. In fact, speaking of the probability of an indi¬ 
vidual hypothesis offers no more logical difficulties than speaking of the 
probability of an individual event, say, the death of a certain person. 

It is sometimes argued that in cases of the latter kind the choice of the 
reference class is easily made—that, for example, the reference class “all per¬ 
sons in the same condition of health” offers itself quite naturally. But critics of 
the frequency interpretation of the probability of theories forget how much 
experience and inductive theory is invested in the choice of the reference 
class of the probability of death. Should we some day reach a stage in which 
we have as many statistics on theories as we have today on cases of disease 
and subsequent death, we could select a reference class that satisfies the 
condition of homogeneity (see §86), and the choice of the reference class 
for the probability of theories would seem as natural as that of the reference 
class for the probability of death. In some domains we have actually been 
witnesses of such a development. For instance, we know how to find a reference 
class for the probability of good weather tomorrow; but before the evolution 
of a scientific meteorology this reference class seemed as ambiguous as that 
of a scientific theory may seem today. The selection of a suitable reference 
class is always a problem of advanced knowledge. 

The method described for the statistical definition of the probability of a 
theory, though it is of a schematized form, is not very different from the 
procedure actually used. What is different, however, is that we do not directly 
observe instances of the form <p(x ), but use other observations from which 
we infer that the form <p(x) holds. Thus we do not directly observe, for a 
certain position of a planet, the force of attraction; we observe, instead, 
successive positions from which we infer the acceleration, and identify it 
with the force of attraction. The distance between the planet and the sun, 
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too, is computed by complicated inferences. Apart from this difference, how¬ 
ever, the inductive inference is expressed by schema (12). Thus, when a 
certain position and acceleration of the planet confirm the law, we assume 
that for this instance the numerical values of the force of attraction, of the 
distance, and so on, satisfy the relation p; but we do not regard a single 
instance as proof that <p will always be satisfied. There might exist a different 
law <p\ which, however, is so constructed that, for the special case observed, 
the numerical values of <p and <p' coincide. That such an assumption is false, 
and that always holds, is a result based on the great number of instances 
in which <p is satisfied. 

We can explain why we prefer, for the establishment of the general law, 
a variation of instances from planet to planet rather than a repetition of 
instances for the same planet. By previous inductions we have established 
a general rule that the various positions of one planet, or the various states 
of one body, are controlled by a simple law; therefore, when we have tested 
a law for the various states of one body, we assume it to hold for all states 
of the body, so that further observation virtually does not supply new infor¬ 
mation. Observations of other planets, on the contrary, must be regarded 
as independent observations. 

The logical analysis of this inference is as follows: we have a probability 
of the second level telling us when the horizontal sequence is long enough 
to justify a posit of its persistence, whereas the posit in the vertical direction 
requires independent evidence and is not justifiable by an analogy with the 
posit in the horizontal direction. What is called the weight of an evidence 
is to be interpreted as the result of previous inductions, all of which are ulti¬ 
mately reducible to induction by enumeration. The analysis shows that the 
actual inferences by which the probability of a theory is established include 
the' r(‘suits of a great many previous inductions, and that any reconstruction 
of the method will remain a simplified sehematization. 

A further difference from induction by enumeration results when we ask 
for the probability, not that the law holds, but that it holds when verifying 
observations of a certain kind have been made. This question concerns a 
lattice inference applied to schema (12) and is answered by the probability 
density v n (f;p) of (5, §62). The answer is constructed in terms of the rule 
of Bayes and presupposes a knowledge of the antecedent probability density 
(j{p). The latter function can be found, in principle, by counting sequences 
vertically in schema (12); a division of the range 0 to 1 for the possible values p 
of the horizontal probability by small but finite intervals dp leads to an 
approximate determination (the precise analysis of this method is given in 
§89). The probability v n (f;p)dp thus ascertained, in which / measures the 
frequency of conforming instances of the law <p in the observed initial section, 
possesses Bernoulli properties and satisfies Laplace's convergence theorem 
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(9, § 62). Whatever be the antecedent probabilities, if n is large enough, 
v n is close to 1 for an interval p ± 8 that includes the value p — f. 

In the theory of confirmation, which is intended to supply the probability 
v n (f;p)dp, the inference is falsely construed as governed by laws that are not 
included in the calculus of probability. The analysis presented shows that 
the calculus possesses all the means to account for the inference, and that 
recourse to laws of an independent “inductive logic” is unnecessary. The use 
of a knowledge of antecedent probabilities for the inference is visible in the 
fact that the scientist usually knows when the frequency n of observed 
instances is large enough to warrant the conclusion of its persistence. His 
judgment about the number n may be construed as following computations 
as explained with respect to (12, § 62). 

The considerations presented make it evident that the probability of hy¬ 
potheses offers no difficulties of principle to a statistical interpretation. That, 
in most practical instances, the statistics cannot be carried through numer¬ 
ically because of insufficient data, and that, instead, crude estimates are used, 
do not constitute objections to a theory that claims to embody only the 
rational reconstruction of knowledge, not knowledge in its actual procedure. 

§ 86. Induction by Enumeration in Advanced Knowledge 

The theory of induction by enumeration in advanced knowledge was pre¬ 
sented in § 62. The inductive inference was shown there to refer to a prob¬ 
ability lattice and to be reducible to an application of the rule of Bayes. 
Assuming that the antecedent probabilities are known and that the sequences 
are normal sequences, the following questions can be answered. 

1. Given an initial section of a sequence with the relative frequency / n , 
what value p ± 8 is the best posit for the limit of the frequency? 

2. What is the second-level probability v n that a limit at p ± 8 will be 
reached? 

If the antecedent probabilities are not known but the sequences at least 
are known to be of the Bernoulli type, we can prove the following theorem: 

3. The greater n, the greater the second-level probability v n that the limit 
will be at p = f n ± 6; and v n converges to 1 with n ». 

We see that the theory of induction by enumeration, in advanced knowledge, 
is as complete as can be required; the theory tells us what value to posit, how 
good our posit is, and that it will become better and better with larger 
numbers. 

In discussions of induction the question has been asked, What is a large 
number? In fact, the conception of a large number varies greatly with the 
field in which the induction is applied. For the test of a new medicine a few 
hundred cases may be sufficient; insurance companies count millions of cases; 
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and physicists dealing with problems of the kinetic theory of gases do not 
speak of large numbers unless they are compelled to use a mode of writing 
in terms of powers of the number ten. It is obvious that the definition of a 
large number belongs in advanced knowledge: it is a number so large that 
the probability of the second level for a sufficient convergence from that 
jlumber on is high enough. A “sufficient convergence” and a “high-enough 
probability” are matters of definition and depend on what is attainable; 
the large number is a function of these definitions, computable or appraisable 
only within advanced knowledge (see 12, § 02). 

A large number is not always necessary for an inductive inference. Some¬ 
times the inference can be based on only one instance. Such inferences occur 
when previous inductions have created a situation in which one experiment, 
called a crucial experiment , is sufficient for an induction. Instances of crucial 
experiments are found in what a physician calls differential diagnosis. One 
Wassermarm test may be regarded as sufficient evidence of a case of syphilis, 
because previous inductions have established a’ relation between the test 
and the disease and have shown that repeated applications of the same test 
to the same person usually lead to the same result. The existence of crucial 
experiments has been misconstrued as evidence against an inductive interpre¬ 
tation of scientific methods, but it can be incorporated without difficulty in 
an interpretation for which all inductive inferences are reducible to induction 
by enumeration. 

A further condition that can be satisfied only in advanced knowledge is 
the condition of a homogeneous reference class. A class of tuberculosis cases 
is a homogeneous class. But it would seem unwise to compile death statistics 
in a class of persons with different diseases or in a class including both human 
beings and animals. The definition of the predicate “homogeneous” depends 
on the state of our knowledge. An inhomogeneous class can be defined as a 
class for which we know methods by means of which the class can be so 
subdivided, without the use of the attribute considered (see § 30), that sub¬ 
classes of different frequencies for the attribute result. The subdivision will 
sometimes be achievable by reference to other attributes, such as are given 
in the example of different diseases, or biological species. However, it can 
be achieved also by dividing the total sequence, or ordered class, into con¬ 
secutive sections. The latter method amounts to a determination of the dis¬ 
persion (see § 52). For an inhomogeneous class the probability of the second 
level concerning the persistence of the observed frequency is lower than for 
a homogeneous class. It is obvious that such considerations are restricted to 
advanced knowledge. 

The relevance of the fair-sample condition, too, can be demonstrated in 
advanced knowledge. If the observed initial section presenting the frequency 
/ is the result of random methods, the formulas (5 and 9, § 62) can be applied, 
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which lead to a high degree of the inverse probability that p — f ± 5. But 
if the sample is selected so as to contain a certain attribute by preference, 
the procedure that supplies it is not the same as the one that, determines 
further observations and is thus not represented by the probability function 
Wn(p’J) of the formulas mentioned. Therefore, Laplace's convergence formula 
for the inverse probability cannot be applied. This explanation makes it. clear 
why the fair-sample condition is attached to inductive inferences only when 
they belong in advanced knowledge; for induction in primitive knowledge 
there is no fair-sample condition. 

In advanced knowledge the inference of induction by enumeration is justi¬ 
fied by the theorems of the calculus of probability. Statements about the limit 
of the frequency and about the* reliability of the inductive conclusion can be 
given a probability meaning (§74); in other words, within the frame of 
advanced knowledge the inductive inference is an appropriate instrument for 
operations controlled by a logic of probability. There is no question of the 
legitimacy of induction in advanced knowledge. 

It is in advanced knowledge, too, that such formulas as the rule of succession 
(22, § 62) find their places. The equality of antecedent probabilities on which 
the applicability of the rule depends must be known before the rule can be 
used. The argument that equality may be assumed in the absence of knowledge 
to the contrary makes use of the principle of indifference, but it was shown 
(§ 68) that the principle is untenable. Logic cannot supply a probability 
metric; only experience and observation can inform us about degrees of 
probability or about the equality of such degrees. But even experience and 
observation can supply such knowledge only if the observational results are 
linked and carried on by inductive inferences. There is no circularity involved 
in such a procedure if it is used in establishing special forms of inductive 
inferences; and the rule of succession, therefore, occupies a legitimate position 
in advanced knowledge. But it would be circular to base the general use of 
the inductive inference on formulas that presuppose the knowledge of a 
probability metric. The ultimate justification of induction must be given by 
other means than formulas that are derivable in the calculus of probability; 
the problem falls entirely within the province of primitive knowledge. 

§ 87. The Rule of Induction 

We turn now to the consideration of induction in primitive knowledge. So 
long as no probabilities have been established, the inductive rule cannot be 
based on theorems of the calculus of probability; therefore we cannot prove 
that the inductive rule leads to the posit of greatest weight, nor do we know 
how probable it is that the limit posited will be reached. We cannot even 
prove that the posit becomes better with a greater number of observed 
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instances. In spite of our ignorance, however, we must use the inductive rule, 
since otherwise we could not establish any probability values and could never 
proceed to the advanced state of knowledge in which the theorems of prob¬ 
ability take over the functions of a guide in inductive method. 

To facilitate the discussion of induction in primitive knowledge, or primitive 
induction ,, we shall proceed by steps. AVe shall not begin with the analysis of 
a state in which nothing is known about the progress of sequences, but shall 
leave the discussion of that question to a later inquiry (§91). leather, we 
shall introduce the assumption that the sequences under consideration have 



Fig. 28. Frequency curve of a sequence converging to a limit. 

a limit of the frequency, although the limit is unknown. Let us see to what 
extent this assumption can help in the solution of the* inductive problem. 

Again the concept of posit will be used for the interpretation of the state¬ 
ments to be considered. The statement that the observed frequency will 
persist can be maintained only in the sense of a posit, since it is obvious that 
we cannot prove it to be true. But it is not an appraised posit, since we have 
no weight for it. In what sense, then, can the' inductive posit be justified if 
we have no proof that the posit will lead to the greatest number of successes? 

To answer the question, we must analyze the way in which the rule of 
induction is used. The inductive posit is not meant to be a final posit. We 
have the possibility of correcting a first posit, of replacing it by a new one 
when new observations have led to different results. From this point of 
view, the following analysis of the inductive procedure can be made. If the 
sequence has a limit of the frequency, there must exist an n such that from 
there on the frequency/*(? > n) will remain within the interval } n ± <5, 
where 8 is a quantity that we can choose as small as we like, but that, once 
chosen, is kept constant. Now if we posit that the frequency/' will remain 
within the interval f n ± 8, and if we correct this posit for greater n by the 
same rule, we must finally come to the correct result. The inductive procedure, 
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therefore, represents a method of anticipation ; in applying the inductive rule 
we anticipate a result that for iterated procedure must finally be reached 
in a finite number of steps. We thus speak here of an anticipative posit . In 
contradistinction to an appraised posit, the weight of which is known, it may 
also be called a blind posit, since it is used without a knowledge of how good 
it is; the term “blind” is meant to express the fact that it is a posit without 
a rating. 

Figure 28 will make the method clear. The abscissa is given by the number 
n of the elements of the sequence; as ordinates, the relative frequencies f n 
are plotted. When the sequence has a limit of the frequency, its oscillations 
will die down. If the sequence is known only to place 1, we posit the corre¬ 
sponding / n L When we continue the observation of the sequence, the next 
posit may be made at place 2, then at places 3 and 4, and so on. At each place 
we use the frequency observed as the best posit. We see that, in going from 
place 1 to place 2, we even make our posit worse; but finally we must reach 
a place (in the diagram, place 4) where the posit is correct within the interval 
26 and will remain so for the rest of the sequence. The inductive procedure, 
therefore, has the character of a method of trial and error so devised that, for 
sequences having a limit of the frequency, it will automatically lead to success 
in a finite number of steps. It may be called a self-corrective method, 1 or an 
asymptotic method . 

The method of the anticipative posit may be formulated as follows: 

Rule of induction. If an initial section of n elements of a sequence is 
given, resulting in the frequency f n , and if, furthermore, nothing is known about 
the probability of the second level for the occurrence of a certain limit p, we posit 
that the frequency f i (i> n) will approach a limit p within f n ± 6 when the 
sequence is continued . 

The distinction between appraised and anticipative posits leads to two 
different kinds of posit. It is a common feature of both that their use is not 
justified for the individual case, but only in repeated applications. With 
respect to the grounds of their use, however, the two posits must be dis¬ 
tinguished. The appraised posit is justified by the principle of the greatest 
number of successes. This kind of posit, therefore, can be used only when the 
corresponding weight is known. The anticipative posit cannot be justified 

1 The self-corrective nature of induction was emphasized by C. S. Peirce, who mentioned 
“the constant tendency of the inductive process to correct itself,” in Collected Payers (1878; 
Cambridge. Mass., 1932), Vol. II, p. 456; see also ibid., p. 501, and Vol. V, p. 90. I have not 
been able, however, to find a passage in Peirce's work where he clearly states a reason for 
his contention. The fact that he constantly connects the problem of induction with that of a 
fair sample^ that is, with the use of random sequences, seems to indicate that he bases the 
self-corrective nature of induction on Bernoulli’s theorem. This interpretation is supported 
by his exposition of the increase in the reliability of induction (ibid., Vol. II, p. 428). Such 
an argument is invalid, of course, since the justification of induction must be given before 
the use of probability considerations. As to my own relations to Peirce, with whose ideas I 
was not acquainted when I wrote the German original of this book, see my remarks in The 
Philosophy of John Dewey (ed. by P. Schilpp; Evanston, Ill., 1939), pp. 188-190. 
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by a maximum principle. It involves another form of justification based on 
the principle of finite attainability. If the sequence has a limit, the anticipative 
posit is justified because, in repeated applications, it leads to any desired 
approximation of the value of the limit in a finite number of steps. 

This argument may be called an asymptotic justification. It includes an 
explanation why the value f n found for the last observed element of the 
sequence is preferable to any earlier value. If after 100 elements we find 
f n = after 200 elements, f n — §, we do not claim that | is a better value 
than \ in the sense that it is more probable. Such a proof is impossible in 
primitive knowledge and can be given only in advanced knowledge (see § 86). 
But if the procedure of going through all elements successively can be justified, 
we know, at least, that in selecting the f n of the later element we are closer 
to the end of the procedure. The choice of the last f n is therefore a matter 
of economy. 

The posit f 7 ‘ is not the only form of anticipative posit. We could also use 
a posit of the form 

f n + c n (1) 

where c n is an arbitrary function, which is so chosen that it converges to 0 
with n increasing to infinite values. All posits of this form will converge 
asymptotically toward the same value, though they will differ for small n. 
We shall prefer the inductive posit / n , for which c n — 0. To do so we can, 
however, adduce only grounds of descriptive simplicity; 2 that is, the inductive 
posit is simpler to handle. 

Whereas the principle of asymptotic convergence determines a class of 
inductive rules as equally justified, a distinction between the members of 
this class—the selection of a particular function c n —can be achieved in ad¬ 
vanced knowledge. For instance, the method of cross induction (§ 84) can be 
regarded as an instrument for finding a function c. n such that the value f n + c„ 
supplies an earlier convergence within an interval 8 than does the value / n . 
This method and others will be discussed in §§ 88-90. 

These results must now be extended to the concept of practical limit, 
which was introduced in § 66. The concept refers to a sequence which reaches 
sufficient convergence after a fairly large number of elements, but which 
may diverge in later parts that lie beyond the reach of human experience. 
It is obvious that the rule of induction is justified, too, when the condition 
of the limit is replaced by that of a practical limit. The justification, in fact, 
will be improved, since finite attainability then means an attainability for 
human capacities. A sequence that converges so late that human observers 

2 Descriptive simplicity is a property of a description that has no bearing upon its truth. 
It must be distinguished from inductive simplicity, which classifies descriptions leading to 
different predictions. See EP , § 42. In the same book, on p. 355,1 tried to give other reasons 
for preferring the posit f n . Dr, Norman Dalkey has since convinced me that they are invalid. 
It is, however, sufficient for the theory of induction that the posit/" is descriptively simpler. 
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cannot experience the convergence has, for all practical purposes, the char¬ 
acter of a sequence without a limit. In the following discussions we should 
therefore regard the limit condition as referring to a practical limit. Since it 
seems unnecessary to mention this interpretation on all occasions, we shall 
speak throughout simply of the limit condition. 

When we use the logical conception of probability, the rule of induction 
must be regarded as a rule of derivation , belonging in the metalanguage. 
The rule enables us to go from given statements about frequencies in observed 
initial sections to statements about the limit of the frequency for the whole 
sequence. It is comparable to the rule of inference of deductive logic, (see § 5), 
but differs from it in that the conclusion is not tautologically implied by the 
premises. The inductive' inference, therefore, leads to something new; it is 
not empty, like the deductive inference, but supplies an addition to the 
content of knowledge. It is a consequence of the synthetic nature of inductive 
inference' that the conclusion cannot be asserted as true, but can bo assorted 
only in the sense of a posit. Since we could show that all inductive inferences 
are reducible to induction by ('numeration, in the wider sense 1 of statistical 
induction (§§ 67, 84), the rule of induction is the only rule of derivation that 
distinguishes inductive logic from deductive logic. In other words, inductive 
logic contains all the rules of derivation of deductive logic with the addition 
of the inductive rule. 

A so-called paradox of inductive logic was constructed by N. Goodman. 3 
He assumes that an initial section of n 0 elements of a sequence has been 
observed and that all elements observed have the attribute B. He now defines 
an attribute C as follows: an element has the attribute C if it is one of the 
first n 0 elements or has the attribute B. It is obvious that all the n 0 elements 
of the initial section have the property C; the rule of induction therefore 
advises us to expect C for the next following element n a + 1 . But since this 
element does not satisfy the first part of the disjunctive property C, it must 
satisfy the second part, and we have inferred by induction that the next 
element will have the attribute B. 

To regard this consideration as an objection against the rule of induction 
would reveal a misunderstanding of inductive method. The rule of induction, 
applied in primitive knowledge, leads only to posits that are justified asymp¬ 
totically. We cannot expect it to supply correct predictions for every individual 
element. This justification includes the case of the property C. Assume the 
total infinite sequence to consist of elements B; then, applying the rule of 
induction to the property C, we shall first make bad posits, but while going 
on will soon discover that the following elements do not have the property C. 

3 “The Problem of Counterfactual Conditionals,” in Jour, of Philos., Vol. 44 (1947), p. 128. 
I present the paradox in a somewhat changed version in order to make it applicable to the 
rule of induction. In the form presented by Goodman it refers to aprioristic inferences that 
are illegitimate anyway. 
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We shall thus turn to positing C and have success. When we count the fre¬ 
quency m for B and the number n of elements in such a way that we begin 
after the first v a elements, the paradoxical inference can be regarded as a rule 

m + n Q 

of induction that posits the frequency of B as given by -———. Since this 

* n + n 0 
m 

value converges with growing // to the valuer —, it is included in the class of 

justifiable rules of induction. The inferiority of this particular rule in the 
example considered cannot be demonstrated in primitive knowledge; in fact, 
if the sequence had nothing but elements B after the ?vth element, the rule 
would be superior to the usual rule. 

It is only in advanced knowledge that the rule can be criticized; and in 
advanced knowledge the inference can be shown to be inferior because it 
violates the rule, “Use the narrowest common reference class available”. 
The property C, by its definition, is identical with B from the {n 0 + l)-st 
clement; since the reference class B is narrower than C, it should be used 
as a basis for the inference (in other words, the property with respect to 
which the first n 0 elements should be counted is the property B). Using the 
property C as reference class means determining the probability of the next 
element with respect to an initial section that is not specified, since every 
initial section of n u elements has the property C. 

But the rule of using the narrowest reference class for which there exist 
reliable statistics belongs in advanced knowledge. It can be applied only 
when it is known what statistics can be called reliable. If ten cases of cholera 
are observed, eight of which have a lethal issue', we assign a death probability 
of 80% to cases of cholera, but would refuse to assign this death probability 
to cases of disease in general. That we are entitled to prefer the narrower class 
of cholera cases to the wider class of disease cases is justified by the large 
amount of statistical material, but it cannot be justified a priori. Cholera 
might be a sort of disease much less dangerous than the average disease, 
and the eight observed death cases might be a misleading exception. 

The choice c n — 0 in (1) cannot be shown to be better than any other 
choice. The illustration shows that even the rule of putting c n = 0 does not 
always lead to the same conclusion. It leads to contradictory conclusions 
according as it is applied to the property B or C. This fact offers no difficulties, 
because inductive conclusions are never final, but arc used only as posits and 
arc canceled when other conclusions lead to their abandonment. The choice 
between contradictory conclusions is made in terms of additional rules, which 
are developed in advanced knowledge. Another instance of this kind was 
studied in the method of cross induction (§ 84): the rule of induction is first 
applied in the horizontal, then in the vertical, direction, and the secondary 
conclusions may lead to the cancellation of some of the primary conclusions. 
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Additional evidence, though not contradictory to the original evidence, may 
change the conclusion—a result unthinkable for deductive logic. With respect 
to the requirement of consistency, inductive logic differs intrinsically from 
deductive logic; it is consistent, not dr facto, but dr faciendo, that is, not in 
its actual status, but in a form to be made. 

Some remarks may be added in order to clarify the nature of the rule of 
induction as a rule of primitive knowledge. The rule is an instrument for 
finding probabilities, but it does not express a probability inference. If an 
event B has been observed with respect to the reference class A in rn out of n 
cases, the rule advises us to expect the event B , when A occurs, with the 
m 

probability — ± <$; but it does not say that the grounds on which the advice 
n 

is based confer a probability on B. The rule says: 

I. If an initial section F” is observed , posit 

P(A,B) =-±5 (2) 

n 

But it does not state a probability relation between the observations Ff n 
and the event B; that is, it does not say: 

II. Posit that 

P(A.F£,B) = ™ ± & (3) 

The two formulations are not equivalent. In formulation i the observed 
frequency is regarded as the ground for the assertion of the posit and therefore 
is included in the content of the rule. In formulation n the observed frequency 
is included in the content, not of the rule, but of the posit, since it is regarded 
as determining the reference class of the probability expression. As the rule 
is one linguistic level higher than the posit, the ground-for-assertion relation 
is formulated in the metalanguage of the language to which the posit, or 
probability expression, belongs; the ground-for-assertion relation is thus one 
linguistic level higher than the probability relation. 

Now it can easily be seen that only formulation i is permissible, since only 
this formulation is capable of a justification. To regard the observed fre¬ 
quency as a ground for the assertion of the probability (2) is justified because 
the latter can be asserted in the sense of an anticipative posit; but we are not 
entitled to assume that the observed frequency makes the event B probable 
to a certain degree. That formulation ii may lead to mistakes can be illus¬ 
trated by the use of sequences with aftereffect: for such sequences (2) would 
be correct, whereas (3) would be false. 
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That the two formulations are not both legitimate distinguishes inductive 
from deductive inference. For deductive logic the two corresponding formu¬ 
lations are: 

III. If a is true, and a D b is true, assert b. 

IV. Assert: a. (a D b) D b. 

They differ, of course, as to content: formulation iii expresses the rule of 
inference; formulation iv expresses the existence of a corresponding implica¬ 
tion in the object language. But both are justifiable, which means here that 
both lead to true statements. The deductive rule of inference, therefore, has 
an object-language correlate in an implication. The inductive rule of inference, 
on the contrary, does not possess an object-language correlate. 

The decision for formulation n as the rule of induction derives from the 
confusion of two relations: the ground-for-assertion relation and the probability 
relation. When we say that the observed frequency is the ground for the 
assertion of the probability (2), we state a relation between our knowledge 
and a prediction, to be formulated in the metalanguage. This relation, the 
ground-for-assertion relation, is not a relation of degree or of order. It selects 
a certain statement as assertable, but it does not include any quantitative 
measure of assert ability. 

The absence of degrees of assertability is illustrated by the fact that the 
asymptotic convergence, which justifies the rule of induction, holds likewise 
for the class (i) of rules; the ground-for-assertion relation determines, there¬ 
fore, a class of different posits as assertable. In advanced knowledge these 
posits would have different weights, but in primitive knowledge they are not 
ordered by degrees. The ground-for-assertion relation cannot assign ratings 
to posits; it merely gives a permit of assertability. 

For these reasons, formulation i is the only admissible form for the rule of 
induction. The justification of the rule in this form has so far been given on 
the assumption of the existence of a limit of the frequency. In a later investi¬ 
gation (§ 91) we shall free ourselves from this last presupposition. 

§ 88. Anticipative Posits in Advanced Knowledge 

The method of anticipative posits, although it is the natural instrument of 
primitive knowledge, can be applied also in advanced knowledge. Instances 
of such application are found in the theory of statistical inference, used for 
the hypothetical introduction of a probability metric (see § 70). These 
inferences can be regarded, on the one hand, as methods of discrimination 
between posits of the class (1, § 87) and thus of determining a function c n 
that improves the convergence. On the other hand, these methods again 
apply the principle of anticipative posits. 
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If a large statistical material is given, it can be used for a direct determina¬ 
tion of the distribution (see § 42); for this determination, statistical induction 
is applied for every interval du of the attribute u. If the material is insufficient, 
however, the method, applied to small intervals du, would lead to unreason¬ 
able results; the distribution would be an irregular zigzag curve, the shape of 
which would not indicate the form to which it would converge asymptotically 
with increased material. A better anticipation of the final curve for incomplete 
statistical material is afforded by the methods of statistical hypotheses, 
developed by Fisher, Neyman, and others. These methods can be regarded 
as determining for every interval du a function c n such that f n + c n , rather 
than f n , represents the best posit for the limit of the frequency. 

The way in which this improvement is achieved must be studied more 
closely. The principle of maximum likelihood, developed by Fisher, may 
serve as the basis of the analysis. Formula (2, § 70) determines the inverse 
probability v n (s) of a value s of the parameter of the distribution d(s;u). 
T shall make the assumption that the function L n (s) of this formula has the 
Bernoulli properties (10, § 57): 

k„(s h ,s 2 ) = F 2 L n (ft)ds 

lim fc„(.Si,.s' 2 ) = k for Si S s, S % (1) 

71-+CO 

lim k n (si,s 2 ) = 0 for s 0 < s t or s () > s 2 

71-*<X> 

The critical point is the value of .s for which, in all intervals dui, the limit 

71 ■ 

of the frequency — coincides with the probability, that is, s a is determined 
n 

for finite n by the relation 

71 i 

— ^ d(s 0 ;Ui)dui ( 2 ) 

n 

The sign ^ demotes approximate equality; for increasing n this approximate 
equality improves, if the assumption that the distribution conforms to 
d(s;u) is correct. The convergence toward k stated in (1) holds for every 
value du, however small, but du is kept constant during the transition to the 
limit. The Bernoulli properties make it possible to apply the inferences used 
for (11, §62) and to show that the maximum of the inverse probability 
converges asymptotically with the maximum likelihood. If nothing is known 
about the antecedent probability q(s ), the principle of maximum likelihood 
can therefore be justified as a method of asymptotic convergence, in the same 
way that the rule of induction is justified. It leads to anticipative posits, 
which must eventually lead to the correct result if this result is attainable. 
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The improvement of induction through a term c n is here achieved by 
means of an asymptotically convergent method, which does not presuppose 
a knowledge of antecedent probabilities. In this sense, Fisher's method has, 
in fact, attained its aim of eliminating the prerequisite of antecedent prob¬ 
abilities. This is possible, however, only if the Bernoulli conditions (1) are 
satisfied; the last of these conditions, in particular, is a rather strong condi¬ 
tion, since it virtually eliminates all but one hypothesis. These conditions 
presuppose the applicability of the special theorem of multiplication, or some 
other rule for combinations, and must be proved for the function d(s;u) under 
consideration, which determines the dependence of L on s. 

Furthermore, the method presupposes an inductive knowledge of another 
kind inasmuch as it presupposes the knowledge of the functional form d(s;u). 
This knowledge excludes a direct empirical determination of the distribution, 
as would obtain by the successive application of the rule of induction to all 
intervals du ; for small numbers n the distribution thus resulting would not 
have the form d(s;u). For incomplete observational material the hypothetical 
form d(s;u) is preferred to the direct empirical distribution. Fisher’s principle 
can be regarded as an extension of the method of anticipative posits to cases 
in which previous knowledge has restricted the admissible hypotheses to a 
one-parameter class (or a class depending on more parameters); the rule of 
positing that selects among this class is then constructed in the same manner 
as the rule of induction. It is an interesting fact that the method of anticipative 
posits, originally the instrument of primitive knowledge, finds renewed appli¬ 
cation on an advanced level of knowledge. In this application it may be called 
a method of restricted anticipative posits. It is an improved version of statistical 
induction, supplying a set of simultaneous inductive posits connected by a 
restrictive condition. The posits refer to the various intervals du and are 
made in such a way that the individual posits are balanced against each other 
by means of a concatenation derived from previous knowledge. 

The bearing of previous knowledge upon anticipative posits may be illus¬ 
trated by an example in which only one inductive posit is made. Assume it is 
known that the limit of the frequency of a certain sequence must be a mul¬ 
tiple of £, that is, one of the values 0, i, f, f, i , 1; then one w r ould not posit 
the observed frequency to persist, but would select the nearest multiple of J 
for the posit. In this example, too, advanced knowledge leads to a deviation 
from the rule of induction. 

The restriction of posited hypotheses to a certain class holds likewise for 
statistical inferences other than those based on the method of maximum 
likelihood, for instance, for Neyman’s methods. All such methods, if they 
dispense with a knowledge of antecedent probabilities, are justifiable through 
asymptotic convergence. This includes the usual method of determining a 
normal curve through statistical ascertainment of mean and standard devia- 
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tion (§ 43). The class of admissible hypotheses is here the two-parameter 
class of normal curves. The condition that the distribution must be known 
to belong to a certain class can be generalized in the following sense: if it is 
known that for the limiting case of an infinite observational material the 
distribution belongs to a certain class, the use of this class for an asymptotic 
convergence is justifiable. Such a generalization was studied by Wald. 

The justification of the method of anticipative posits in advanced knowledge 
differs from the one concerning primitive knowledge so far as it cannot be 
proved that the method must lead to success if success is attainable. It can 
be proved only that success is probable if it is attainable. The limitation to a 
probability originates from the merely probable' character of the' inductive 
assumptions entering into the method. It may be false that the distribution 
belongs to the class considered; then there might still exist a limiting dis¬ 
tribution, but it would not be found through the method. In primitive knowl¬ 
edge the method of anticipative posits is free from this limitation. 

The methods of restricted anticipative posits share the property of the 
general method in determining, not one rule of positing, but a class of such 
rules, all equally justified. For instance, instead of selecting for the posit the 
value s 0 of the parameter s determined by the rule of maximum likelihood, 
it is permissible to select a value s a + c n , where c n is a function converging 
to 0 with increasing n. It is impossible to distinguish one of these values as 
preferable to the others unless something is known about the antecedent 
probabilities. Without such knowledge the decision for a certain form of c n , 
for instance, c n = 0, must be regarded as a matter of convenience, as was 
explained with respect to the rule of induction. Although advanced knowledge 
leads to a restriction of functions c n as compared with primitive knowledge, 
it still leaves open a certain class if it is dependent on anticipative posits. 

The term likelihood was selected by Fisher in order to indicate that his 
principle does not determine an inverse probability; he believes that the selec¬ 
tion of statistical hypotheses is governed by rules other than those of prob¬ 
ability. I cannot agree with his conception. The only legitimate system for 
the control of hypothetical assumptions is the calculus of probability in 
combination with the frequency interpretation; it is the only predictive 
system that is justifiable. The logical function of the likelihood method is 
not to supply a second kind of probability but to supply a rule of assertability 
which, though differing from the usual rule, is still justifiable in terms of the 
calculus of probability. When the usual rule, “Assert the most probable 
hypothesis”, is inapplicable because of lack of knowledge of antecedent 
probabilities, the likelihood method replaces it by the rule, “Assert a hy¬ 
pothesis which, if continuously adjusted to an increasing statistical material, 
converges with the most probable hypothesis”. This rule is applicable when 
the forward probabilities have the asymptotical properties of Bernoulli 
probabilities. 
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The asymptotical rule, however, cannot provide degrees of preference for 
inductive hypotheses; it merely distinguishes admissible series of hypotheses 
from inadmissible ones, without discrimination among the admissible hypoth¬ 
eses. A reasonable measure of preference, or expectation, would be given by 
the inverse probability. The likelihood cannot take its place, for the following 
reasons: 

1. The order of likelihood, in general, does not correspond to the order of 
inverse probability. If the inverse probability is known, the likelihood must 
therefore be disregarded. 

2. If the inverse probability is unknown, the order of likelihood does not 
supply an order of assertability; hypotheses of different degrees of likelihood, 
if they belong to a certain class, are assertable with equal rights. 

It was explained in § 87 that the use of the inductive rule in primitive 
knowledge' is not based on a probability relation but on the' ground-for- 
assertion relation. A similar result holds for anticipative posits in advanced 
knowledge. The method of maximum likelihood, and with it all other asymp¬ 
totical rule's of statistical inference, are forms of the ground-for-assertion 
relation. They de> not convey a probability, or a degree of assertability, to 
the assertion elesignateel by them, but they e>ffer legitimates grounds for the 
choice of these assertions. Asympte>tic convergence cannot supply a graduateel 
scale for the rating of hypotheses, but only a yes-or-ne) criterion. 

The term hypothesis is usually meant to imply a certain credibility of the 
conjecture. If this connotatiem is accented, the method of maximum likelihood, 
applied without a knowledges of antecedent pre>babilitios, supplies, not a 
hypothesis , but an assumption , the latter term being free from such connota¬ 
tions. The distinction between hypothesis and assumption then repeats, for 
a comprehensive thesis, the distinction between appraiseel and blind posit. 
The establishment of hype)theses is loft to the inference by indirect evidence, 
which supplies a degree of probability, or an estimate of it, to a hypothesis. 
Anticipative posits, even in advanced knowledge, are forms of statistical 
induction anei can establish assumptions only. In application to cases of 
unknown antecedent probabilities, Fisher’s principle must therefore not be 
classified as supplying hypothetical probabilities, i.e., as an instance of the 
third method mentioned in § 70, but as an improved version of statistical 
induction, i.e., of the first method. 

Incidentally, practical applications that at first sight appear to have the 
form of the anticipative method (including, presumably, most actual appli¬ 
cations of the theory of statistical estimation) will often turn out, on closer 
examination, to be inferences by indirect evidence. A very inexact knowledge 
of antecedent probabilities may be sufficient to transform the anticipative 
method into an inference by indirect evidence, and thus into a method that 
supplies a probability estimate for a hypothesis. The situation resembles the 
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one holding for the inductive inference by enumeration, which in most prac¬ 
tical cases includes advanced knowledge and therefore admits of an estimate 
of its reliability (sec § 80). 

The method of maximum likelihood has recently found an application in 
one of the attempts to construct a theory of confirmation of hypotheses by 
deductive means. This theory, which was developed by (). Helmer, C. Hem pel, 
and 1\ Oppenheim, 1 concerns propositions that are built up in terms of a 
finite number of one-place predicates Bi . . . B t . Since the; number of possible 
combinations of such predicates in affirmative} or negative form is finite, the 
theory is concerned with finding a probability metric that coordinates a degree 
of probability to each “cell” of a logical space, that is, to each predicate 
combination Q i} such as Qi = B 1 .B 2 .B 3 .B 4 .B 5 . . . B t . The number of indi¬ 
viduals x k to which the predicates apply need not be finite. 

Other theories of confirmation, like R. Carnap’s theory, 2 introduce a metric 
of this kind on a priori grounds, and it is difficult not to classify them along 
with aprioristic methods like the principle of indifference. The theory of 
Ilelmer-Hempel-Oppenheim claims to get around this difficulty by making 
the metric dependent on observational material. They assume that certain 
ccills of the logical space are occupied, that it is known of a finite number of 
individuals which predicates pertain to them, or that it is known, at least, 
whether some cells are or are not empty. They then determine the most 
suitable probability metric for their space by the principle of maximum 
likelihood. 

The latter method degenerates for this problem into a simpler form. The 
probability distribution is not continuous, but consists of a finite set of 
probabilities pi . . . p t . Consequently, a functional form d(p) need not be 
assumed, and the p\ . . . p t may be regarded as the parameters of the dis¬ 
tribution. The method then consists in determining the combination pi . . . p t 
that makes the observed occupation of cells the most probable result. The 
only synthetic assumption entering into the method is the assumption of 
independence, that is, the use of the special theorem of multiplication, without 
which (or some equivalent) the method of maximum likelihood cannot be 
carried through. This assumption restricts the applicability of the method 
to advanced knowledge, if the probabilities used in the theory have a fre¬ 
quency meaning. The assumption of independence is then testable through 
observations in the logical space. 

Some numerical instances may illustrate the method. If only one predicate 
B is considered and n observations have been made, m of which fall into the 

1 O. Helmer and P. Oppenheim, “A Syntactical Definition of Probability and of Degree of 
Confirmation,” in Jour, of Symbolic Logic , Vol. X (1945), p. 25. C. Hempel and P. Oppen¬ 
heim, “A Definition of Degree of Confirmation,” in Philos, of Science , Vol. XII (1945), p. 98. 

2 “On Inductive Logic,” in Philos, of Science , Vol. XII (1945), p. 72. 
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class B, the probability of B is —, in accordance with the frequency interpre- 

n 

tation. Differences arise if the observational knowledge has a disjunctive 
form. If it is known that out of six observations either two or four fall into B, 
the theory provides the probability \ for B. This result seems rather strange: 
the direct application of the rule of induction would supply the answer that 
either | or § should be taken for the probability. The intermediate value \ is 
the result of a different rule for the; selection of the anticipative posit, based 
on the method of maximum likelihood in combination with the assumption 
of independence; but since the method cannot adduce reasons for the prefer¬ 
ence of its posits, the inverse probability being unknown, the value 5 is a 
blind posit in the same sense as the posit “l or §”. The difference drops out 
when larger n arc 1 considered, since then the theory shows two maxima of the; 
likelihood to exist, which means a disjunctive answer. 3 

The discussion shows that the; authors an^ mistaken if their theory is meant 
to supply more than an asymptotic rule and could give' answers beyond the 
reach of the rule of induction. Like all other forms of statistical inference, the 
theory is an extension of the method of anticipative positing to advanced 
knowledge; it can be applied if a previous use of the' rule of induction has 
shown that the assumption of independence presupposed for it is true. It is 
obvious that the theory cannot establish an inductive logic; it depends for 
its justification on the meaning of probability and the methods developed for 
the frequency interpretation. 

Theories like those of Carnap and Helmer-Hempel-Oppenheim may be 
said to develop a deductive conception of probability, because they derive the 
values of probabilities from observational material by deductive methods 
alone. 1 should like to summarize my criticism of all such methods by com¬ 
paring them with the inductive conception of probability, expressed in the 
frequency interpretation, for which the derivation of a probability value 
presupposes the use of inductive inferences. 

Given a sequence of events or other objects in which the frequency “ 

of a certain attribute converges toward a limit p , we can coordinate to it a 
set of numbers p n such that 

m 

lirn p n = lim — = p (2) 

n-*o o it-* co n 

We require that the p n be defined without a knowledge of the value p ; the 
convergence then must be demonstrable from the functional form of the p n 
on the assumption that the limit p exists. The interpretation of such asymp¬ 
totic indices p n is the subject of the controversy. 

3 In order to secure this result for all cases of this type, the theory must be corrected so as 
to admit likelihood maxima of different heights as assertable metrics. 
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Asymptotic indices, which are derived from observational material by 
deductive methods, are an indispensable instrument for both the deductive 
and the inductive conceptions. In the deductive conception the p n are called 
probabilities; T will call them deductive probabilities. They must be carefully 
distinguished from deduced probabilities (see § 70 ), which are deductively 
derived from other probabilities and thus presuppose the use of inductive 
inferences at some place. Deductive probabilities are single-case probabilities 
because their values do not express frequencies; the values p n vary, in general, 
from element to clement. The inductive conception reserves the name of 

T/Z 

probability for the limit p of the p n , or of —, and thus employs only inductive 

probabilities , that is, probabilities the values of which can only be asymptoti¬ 
cally determined through observational material. Such probabilities are class 
probabilities; if they are applied to individual cases, the transfer represents 
only a mode of speech translatable into frequency statements. Besides in¬ 
ductive probabilities, the inductive conception uses asymptotic indices p n 
derived deductively from observational material; however, it does not call 
the p n probabilities, but regards them as substitutes for the probability p, 
to be used as long as the limiting value p is unknown. The simplest form of 

Yt% 

a rule determining values p n is the rule of induction, which puts p n = 

So far, the difference between the two conceptions is merely terminological. 
It does not matter whether the p n are called probabilities of the individual 
element or whether the limiting value p is regarded as the probability of each 
element, for which the value p n is a substitute. A material difference arises, 
however, with respect to the choice of the rules determining the p n . 

The inductive conception regards the decision for a particular rule of this 
kind, if the rule is set up in primitive knowledge, as a convention. Since the 
applicability of the rule is controlled only by the asymptotic criterion (2), 
the value of an individual p n is arbitrary and without objective significance; 
it does not describe a property of the sequence. The different sets of values p n 
are asymptotically indistinguishable. It is for this reason that the p n are not 
called probabilities in the inductive conception; the name is used only for the 
limit p, which is an objective characteristic of the sequence. Adherents of 
the deductive conception, however, select among the admissible rules for the 
p n one rule as the best, asserting that they can prove its superiority. For this 
reason they regard an individual p n as objectively significant and call it a 
probability. Being derived from observational material, this probability is 
usually construed as holding relative to the material, so that an expression 
denoting the observational material enters into the reference term of the 
probability functor. Inductive probabilities do not include such a reference, 
because they arc independent of the observational material from which they 
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are inductively inferred; they express an objective relation holding for a 
sequence. 

Since the asymptotic convergence cannot supply a criterion of st‘lection 
among the rules, deductive theories must resort to aprioristic reasoning. 
Some of these theories, in fact, are openly advanced with the appeal to a 
synthetic a priori, such as the theories based on the principle of indifference 
or on rational belief. Others, like those of Carnap or Hehner-Hempel-Oppen- 
heim, arc intended to use only analytic principles for the determination of 
the p n . It is obvious that such theories must break down, since analytic 
reasoning cannot prove any sot p n satisfying (3) to be superior to the others. 
If it is argued that such a theory is justified because its probabilities converge 
asymptotically with the one supplied by the inductive theory, 4 the argument 
is easily seen to be invalid: asymptotic convergence justifies all such rules 
alike, but cannot be adduced if differences between the rules are under 
discussion. 

The rules for the selection of the p n are sometimes based on the assumption 
of a metric q{p) for the antecedent probability (Carnap), sometimes on the 
principle of maximum likelihood (Fisher, Helmer-IIempel-Oppenheim), a 
method which, as far as the selection of values p n is concerned, coincides 
with the metric q(p) = const, (see § 70). If the aim of the method is to con¬ 
struct an asymptotic rule, any such metric is permissible, not because the 
metric q(p) is derivable from a “plausible” principle like the principle of 
indifference, but because the metric is irrelevant. Laplace’s convergence rela¬ 
tions (9, § 62) guarantee the asymptotic convergence for every metric qCp), 
if it is known, at least, that the Bernoulli relations (10, § 57) are satisfied. 
But this validation of deductive methods for the selection of a set p n is very 
different from the intentions of their authors: every deductive method of 
this kind is admissible because it leads to a rule of induction of the general 
form (1, §87), and there are no arguments that favor any of them. The 
search for a distinction among the class of admissible inductive rules in 
primitive knowledge concerns a pseudoproblem. 

My objection to deductive theories, therefore, is not that their sets p n are 
different from the one introduced by my inductive rule. I have emphasized 
repeatedly 6 that the justification of induction refers to the class of rules 
(1, § 87) and is not restricted to the inductive rule in the narrower sense. 
My criticism concerns the claim of deductive theories to select one set p n as 
superior to others by methods for which no justification is given. By justifi¬ 
cation I understand a proof, not that some people believe in these methods or 
behave as though they believed in them, but that it is advantageous to use 
them. 

4 This argument is used by R. Carnap, “On Inductive Logic,” in Philos, of Science, Vol. 
XII (1945), p. 97. 

8 For instance, in EP t pp. 355-356. For a correction of my earlier views see footnote, p. 447. 
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It was mentioned above (§71) that it is possible to introduce a concept 
of probability for which the principle of indifference is analytic. Similarly, 
it is possible to define a concept of probability for which the rules determining 
the p n are analytic. Of this kind is Carnap's concept of a nonfrequency prob¬ 
ability. But this probability concept, and with it every deductive probability, 
is empty because it has no predictional value. To say that p n = f is as justi¬ 
fiable as saying that p n = \ because both values fit the asymptotic rule. And 
both values are compatible to the same extent with whatever happens. 
Assume that p n was chosen = J, but that observation of further s elements 
shows the frequency to converge to f. Then the value p n = \ is not regarded 
as falsified, although a new probability p n+s close to 3 is now introduced; 
this probability is said to have different observational material in the reference 
term. Through this evasion of verification by future observation, deductive 
probabilities lose any predictional value. If the statement p n = \ is true, it 
does not help to know it, because it has no implications for the future. 

This is why the inductive conception does not call the p n probabilities. If 
the value p n is regarded as the measure of a probability relation between the 
observational material and a future event, the probability relation is ambigu¬ 
ous; there is no most advisable value for p n in primitive knowledge. There is 
a ground for asserting p n ; but the ground-for-asscrtion relation determines 
without discrimination a class of values p n in terms of the observational 
material. It is not the use of an individual p n but the progressive use of the 
total set that can be shown to be advisable; such progressive use will lead to 
a prediction of the frequency if there is a limit of the frequency. To this 
limit of the frequency the inductive conception applies the name of prob¬ 
ability. The degree of probability is thus made a number uniquely determined 
for classes of observable objects, independent of the state of our knowledge. 
The distinction between (2 and 3, § 87) expresses the difference between the 
two conceptions. What is called a probability relation in the deductive con¬ 
ception is actually a ground-for-assertion relation inaccessible to quantitative 
measure. 

I do not wish to say that there are no methods of selecting a particular 
set p n as better than others; on the contrary, the inductive conception has 
developed such methods. But it constructs them only in advanced knowledge. 
This means that it bases the selection of a set p n on more material than is 
given by the initial section of the sequence up to the place n; arid for the 
selection it uses the results of inductive inferences referring to other sequences. 
For instance, if, in advanced knowledge, the special theorem of multiplication 
is to be used, it must first be proved that the condition of independence holds; 
this proof presupposes the use of the rule of induction for a certain material, 
since it amounts to showing that the limits of two frequency sequences are 
equal. Or, if the anticipative posits are restricted to a certain form of the 
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distribution d(s;u), as for Fisher’s method of maximum likelihood, the choice 
of the form must be based on previous inductive inferences referring to 
further observational mate;rial. If methods of this kind lead to regarding the 
individual p n as probabilities, this means that the p n have been shown to be 
the limits of the frequencies in certain other sequences. For instance, the p n 
can be the limits of vertical columns in a lattice built up of horizontal se¬ 
quences, such as the probability values given by the rule of succession (22, 
§ (52). The justification of all methods of improved convergence is therefore 
reducible to the justification of statistical induction and thus to the use of 
asy mptotie methods. 

The problem of selecting a set p u that leads to a better convergence, which 
the deductive conceptions cannot solve 1 , is thus made 1 accessible to a solution 
free from aprioristic methods. The 1 solution consists in demonstrating that 
the use of asymptotic rules for a wider material can lead to a selection among 
asymptotic rules for a narrower material. The following sections (§§ 89-90) 
will show in what sense 1 this proof can bo given. 


§ 89. The Method of Correction 

Since the rule of induction is the only primary instrument of finding prob¬ 
abilities, it must be possible to show how advanced knowledge is built up 
from primitive knowledge by means of the rule of induction. 

By means of the inductive rule we set up posits concerning the limit of the 
frequency in a sequence and thus establish probability values. The prob¬ 
abilities so constructed can be used as the weights of certain other posits; 
we are thus able to construct appraised posits by means of anticipative posits. 
The appraised posits can even be identical with some of the anticipative 
posits; in other words, we can transform an anticipative posit into an appraised 
posit. Since the weight thus constructed can be used for a change in the 
posited value of the limit, we speak here of the method of correction. 

The general method of correction is based on the combined use of a great 
number of posits; the totality" of posits is so evaluated that certain individual 
posits can be corrected. In this procedure we start with primary posits, posits 
of the first level, and then construct from them secondary posits, posits of 
the second level, which supply weights of the primary posits. By this method 
a weight can be constructed for every posit of the first level. The statement 
about the weight, however, will itself represent an anticipative posit. In the 
same manner we can construct a weight for the secondary posits, and so on. 
Proceeding to higher and higher levels, we can find a weight for every posit 
on any level and thus can transform every anticipative posit into an appraised 
posit, though there will always remain anticipative posits in the system, 
namely, those of the last level. 
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The procedure, which represents the transition from the primitive to the 
advanced state of knowledge, follows the schema that we called cross induc¬ 
tion (§ 84); its mathematical structure corresponds to the methods developed 
in § 02 . A finite number of finite sections of the sequences, that is, a finite 
initial part of the lattice* (3, §58), is given: 
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has been determined for every horizontal section. As in § 02 , the range from 
0 to 1 is divided into r intervals rp . . . rj T of the width 26; the middle value 
of an interval r) p may be called p p or f p . If/*" lies within r? p , we write 1 


f kn = p P ± 5 

The system of primary posits is constructed by putting 
lim /*" = p p + 5 if f kn — p p ± 8 


(3) 

(4) 


In order to obtain the system of secondary posits, we count for each rj p the 
horizontal sections of the sequence for which f kn — p p ± <5. Thus we deter¬ 
mine frequencies </ p n , which we define by 

0? = \n (f kn = p, ± 5) (5) 

& k* *1 


From here we go in two steps to the secondary posits. We first posit that the g p n 
represent for the limit n co the frequencies of the s sequences for the inter¬ 
vals rip. Using the notation 


we thus posit 


g\ = 7 N (lim /*“ = p„ ± 5) 

o W-+0O 

8 «n . 

g P = g P ± e 


( 6 ) 

(7) 


where e is a given interval of exactness of this posit. On the second step we 
posit that the values g‘” also represent the limits for s <», that is, we posit 

q„ = lim g‘ = g' p * ± ij ( 8 ) 

8“* oo 


fc ^ In this notation the expression p zk 6 is regarded as an indefinite description (see ESL , 
p. 264), so that (3) means: there is a value within the interval from p — 6 to p + 8 such that 
f kn is equal to this value. 
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This is the posit by which we obtain probabilities of the second level q p from 
the finite section ( 1 ) of the lattice. q p is the probability that the probability 
of the first level is — p p + 6. 

The probability of the second level q p does not immediately represent the 
rating of the posit ]> p of the first level; r/„ is the antecedent probability of p py 
but not its rating based on the observed initial section. This rating is supplied 
by the inverse probability v ny which was computed in § 02 as a certain function 
of the q and p. The treatment used here, which is meant to portray the 
approximative methods of primitive knowledge, is distinguished from the one 
used in § 02 in that the intervals r] p are finite and a transition to the limit 0 
is not made. Correspondingly, summations take the place of integrations. 
The precise frequency / is replaced by the frequency interval f ± 5, and the 
inverse probability density v n (f',y) of ( 5 , § 02 ) is replaced by a density 
v n (f ± S;p). Corresponding to a remark made in § 45, a relative probability 
function having a small interval in the reference class is approximately equal 
to the function written for a precise value in the same place; the value (5, § 02 ), 
therefore, is approximately valid. 

For the construction of the exact value, the following notation will be used. 
The inverse probability that, if / is within an interval y p , p is within the in¬ 
terval rjp, will be written v npp . The forward probability that, if p is within an 
interval t] 0) the frequency / lies within the interval 7j Pf will be written b nap ; 
for normal sequences this is the Bernoulli function (10, § 49). Thus, instead 
of (5, § 02) and (9, § 02), the following relations result: 


q^nup 

1'npp ~ 

r 

nap 

a 



/1 for p = g 
\o for p ^ fi 


(9) 


The proof of the convergence to 1 or 0 , respectively, follows because, accord¬ 
ing to (25, § 49), b npp goes to 1 with increasing r?, whereas b nap goes to 0 for 
<r ^ p; for large n the denominator reduces virtually to the term q p b npp . 

The presentation up to ( 8 ) has shown how the antecedent probabilities q p 
can be determined from statistical material in the form of posits. In order 
to determine v nptl we must know, furthermore, the functions b nap . But the 
determination of these functions presupposes a knowledge of the normal 
character of the sequence, or some other knowledge that enables us to compute 
the probability of combinations, like the knowledge that the sequences have 
probability transfer. I shall therefore indicate, as an illustration for the 
determination of sequence structure in primitive knowledge, by what methods 
the normal character of the sequences can be ascertained. 

The method consists in the determination of probabilities in subsequences 
and thus is not different in principle from the methods so far presented. We 
first examine whether a given sequence is free from aftereffect. We construct 
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a subsequence from the given section of the sequence by the use of a selection 
by the first predecessor; in this subsequence the probability is determined 
by a posit according to (4). We repeat the same procedure for all subsequences 
that result from a selection by a group of two predecessors, and so on, up to r 
predecessors. Now wo conceive the probabilities thus found in t subsequences 
as a new probability sequence, and judge about its continuation for £->-«>, 
that is, for a transition to longer and longer groups of predecessors, by means 
of a posit. Assume, for instance, that the probabilities of the subsequences 
considered are, in the majority, equal to the probability of the main sequence 
within the interval of exactness ± <5; then we regard the coincidence of the 
two probabilities as a property C and posit that the frequency of C will go 
toward 1 for the limit t °°. The rule of induction is thus applied to obtain 
a probability of the second level and leads to the posit that for all subsequences 
there exists a coincidence between their probabilities and the main probability. 

By means of a corresponding procedure we can determine also the prob¬ 
abilities of the subsequences resulting from regular divisions (§ 30). Corre¬ 
sponding considerations hold for properties of the lattice. The methods that 
determine the type of the sequence from the dispersion can also be included 
in the procedure explained above. Thus we see that the ascertainment of any 
special type of sequence can, in principle, be carried through by applying the 
rule of induction, whether we deal with normal sequences in the narrower 
or in the wider sense, or sequences with probability transfer, or Bernoulli 
sequences, and so on. No a priori assumptions are required. 

That the inductive rule is all we need follows from the fact that we could 
construct the entire calculus of probability from axioms that are logically 
derivable from the frequency interpretation. We could characterize every 
special case by the property that some probabilities are numerically equal 
(as in normal sequences), or that they approach a limit (as in Bernoulli 
sequences). The occurrence of such a property can always be established by 
means of the rule of induction. We thus confirm the previous result that all 
applications of the calculus of probability to physical reality can be carried 
through by using the rule of induction as the only nondeductive principle. 

Once the b ntip have been determined within the ascertainment of the type 
of the sequence, the probability v npfi can be found from (9) in the form of a 
posit. The values q p and b ntip to be inserted in (9) are posited, not ultimate, 
values; thus the value v nptl can be posited only within a certain interval of 
exactness 0: , 

v n pn = ± * ( 10 ) 

^ ^ Qab n <r P 
<r — l 

When the weights v npit are determined, they can be used for the correction 
of the primary posits. Should it turn out, for instance, that for some of the 
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sequences the maximum probability is not given by v npp , but by v npp , where 
p 9 ^ g, we shall correct the original posit correspondingly. We can thus 
transform the original anticipative posit into an appraised posit. Such a case 
will occur, for instance, when all sections (1) have a frequency within ± 5, 
except one; for this sequence we shall then posit the limit also within/ M + 8, 
though this does not correspond to its frequency f kn , which is within f p ± 8. 
We have thus constructed the method of cross induction, introduced in § 84. 

We now see the importance of the convergence relation added in (9), 
according to which v npp goes toward 1, no matter what the values of the 
q p are. If it turns out, in a prolongation of the sections, that the frequency 
remains within f p ± 8, the method of correction must finally lead to the 
result that v ripp > v nptl . We meet here with a peculiar interrelation of the 
totality and the individual case. At first the totality is more “powerful” 
than the individual case given by one of the sequences, and may correct it. 
With sufficient prolongation of the one sequence, however, the individual 
case can finally become more “powerful” than the totality. Even if only one 
sequence of a deviating frequency occurs, we assume for a very great n that 
we are actually concerned with a deviating individual case. We refer to the 
illustration given by the deviation of the planet Mercury from Newton’s law 
(see § 85). This is the version given by the theory of probability to the specific 
tension that exists between individual case and totality in all empirical inquiry. 
An exception from a hitherto known law will at first be interpreted as an 
error. If it recurs frequently, however, it will be acknowledged as a fact, 
and the general law will be altered correspondingly. In the theory of prob¬ 
ability this independence of individual facts finds its expression in the con¬ 
vergence relation stated in (9). 

§ 90. The Hierarchy of Posits 

I shall summarize the results so far obtained. It has been shown that, if we 
know the limit of the frequency in a sequence, this value can be regarded as 
the weight of an individual posit concerning an unknown element of the 
sequence. The weight may be identified with the probability of the single 
case, assuming the character of a truth value. In order to find the limit of the 
frequency we use an anticipative posit; its weight is unknown. In order to 
determine this weight we must make an anticipative posit on a higher lin¬ 
guistic level; the former anticipative posit is then transformed into an appraised 
posit, that is, a posit of known weight. 

The procedure can be extended to higher and higher linguistic levels. It is 
obvious, however, that we shall never reach a true statement in this way; 
each of the metalanguages in this infinite hierarchy follows a probability 
logic. The infinite system may be compared with the system constructed in 
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the fourth column of table 8 (p. 392). There is, however, an important differ¬ 
ence. The latter infinite system is given by a mathematical rule, whereas the 
probability system must be constructed step by step, by the use of empirical 
methods. We must stop on a finite level without any knowledge about higher 
levels; in particular, we do not know whether the probability system is con¬ 
vergent. When we thus cut off the probability system on a finite level, we do 
not know how great the mistake so introduced will be. The posit on the last 
level will be an anticipative posit, the weight of which is unknown. 

The fact that we must cut off the hierarchy of posits at a posit of unknown 
weight endows the structure of probability logic with a peculiar uncertainty. 
The question offers itself: What is gained by the transition to posits of a 
higher level? Why do we prefer to use appraised posits instead of anticipative 
posits on the lower levels, if all the weights are obtained ultimately by antici¬ 
pative posits? What is the logical significance of the structure of the levels 
if on the last level we remain with anticipative posits? 

In order to answer these questions, assume that the system is cut off at 
the secondary posits, and apply to the secondary posits the principle that was 
used for the justification of the anticipative posit on the first level. This 
justification was based on the self-corrective nature of induction; we showed 
that the anticipative posit leads to the correct value of the limit within the 
given interval of exactness ± 8 if the sequence has a limit. The same prin¬ 
ciple is applicable to secondary posits. But we must now compare the con¬ 
vergence of primary and secondary posits. 

Consider the determination of the secondary probabilities g p that was given 
in §89. The primary posit used there is formulated in (4, §89); the two 
steps necessary for obtaining the secondary posits are given in (7 and 8, § 89). 
The following theorems are derivable from the definition of the limit, on the 
assumption that the sequences have a limit of the frequency: 

1. If s is kept constant, there is an n = n\ such that the primary posits 
(4, § 89), concerning the horizontal sequences, become correct for all s 
sequences and remain correct for all greater n from there on. 

2. If « is kept constant, then; is an n = n 2 such that the secondary posits 
(7, § 89), concerning the vertical frequencies up to s sequences, become 
correct and remain correct for all greater n from there on. 

If we understand by Wj and n 2 the smallest values having these properties, 
we can prove the inequality 

n 2 g ni (1) 

This relation follows from the fact that the number (f p n can correspond to g 8 p 
within the interval of exactness ± e even when the correct value of the 
horizontal limit has not yet been found for all s sequences. In some horizontal 
sequences the primary posits may still be false, although the frequencies 
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counted vertically, for the determination of g* P} may have arrived within the 
interval of exactness ± e. If the primary posits, however, are correct in all s 
sequences, the secondary posits (7, § 89) must also be correct. [Only in excep¬ 
tional cases will the equality sign in ( 1 ) hold; but we cannot make use of this 
qualification for the following considerations, since it is based on an inductive 
inference from our experience. It cannot be derived from the definition of 
the limit.] 

Adding the transition to s > <», we can derive the following theorems on 
the assumption that all the sequences have a limit: 

3. There is an $ = s 3 and an n = n 3 such that all the posits ( 8 , § 89) are 
correct. 

To prove this theorem, assume that there exists a limit of the frequency 
for vertical counting with respect to horizontal sequences as elements. Then 
there must be an s = s 3 such that the frequency g* p , for the definition of which 
we have assumed that the limits in the horizontal sequences are known, 
corresponds within the interval of exactness ± f to the value q p resulting 
for s oo. Now we do not know the precise value of g* p because the horizontal 
sequences are known only up to n elements. But since, according to theorem 2 , 
the value g a p n found statistically will correspond to g* within the interval of 
exactness e, the value g* p n will be close to q p within the interval ±(e + f). 
It can thus be brought as close to q p as we wish by a suitable choice of s 3 
and in. The posits ( 8 , § 89) will be correct when e + f :§ 77 . 

4. Generally, there is no n — n A such that for all s the posits (4, § 89) 
are correct. Yet if there is such an n = n 4 , that is, if we deal with the special 
case of a uniform convergence (§ 65), the inequality 

n 3 g rc 4 ( 2 ) 

holds. This follows by considerations as given for theorem 2 . 

The four theorems express an intrinsic difference between primary and 
secondary posits, which can be formulated in the following statements: 

а. The secondary posits are independent of the individual primary posits. 
An individual primary posit may be false, though the secondary posit is true 
[inequalities ( 1 ) and ( 2 )]. 

б . If all the sequences have a limit of the frequency, the correctness of all 
the secondary posits can be reached in a finite number of steps, though the 
primary posits to which they refer are not made in an infinite number of 
sequences. This holds whether or not there is a uniform convergence. 

These results, which are derived from the assumption of the limit of the 
frequency alone, formulate the superiority of secondary posits over primary 
ones. Briefly, they say that secondary posits are independent of primary 
posits; that, increasing the degree of approximation by going to greater values 
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of w, we may count upon the possibility that secondary posits will be correct 
at an earlier stage than primary ones. The transition to secondary posits 
makes possible a better convergence. This mathematical result justifies us 
in regarding secondary posits as better than primary posits. 

We are now in a position to recognize the significance of the structure of 
levels in probability logic. The order of levels does not lead to a system that 
as a whole is equivalent to a two-valued true statement, as in the multivalued 
logic developed in § 76. Neither can we claim that the infinite system of 
probability logic will strictly converge, as does the infinite system of column 4, 
table 8 (p. 392). But we can make a statement of convergence in a different 
form: when we improve the approximation by extending the primary se¬ 
quences, the posits of higher levels will, in general, become correct at an 
earlier stage—in any case, never later than the posits of lower levels. In this 
sense the multivalued system of probability logic, which has an infinite 
number of levels of language, is a convergent system. The higher the level 
on which it was cut off, 1 the sooner we may expect the convergence of this 
system with increasing n of the primary sequences. The only assumption on 
which the theorem is based is that the sequences converge to a limit of the 
frequency. Such a system will be called a quasi-convergent system. 

These results will make clear the use of probabilities for the verification 
of limit statements, a procedure that is applied within an advanced state of 
knowledge (see § 74). Only within a system of two-valued logic is it impossible 
to verify statements about infinite sequences or practically infinite sequences 
before the sequences are completely observed. Within the framework of prob¬ 
ability logic, however, such a verification by degree is possible. The statement 
of verification itself is imbedded in a metalanguage that is governed by 
probability logic. It is true that we must cut off this system on a finite level 
and stop with a posit of unknown weight. But we know that if the system 
is quasi-convergent, verification by levels has a distinct advantage: it opens 
the possibility of a better convergence than would be attainable by the use 
of primary posits alone. The statement that the posits so appraised are the 
best posits is itself a posit, and an anticipative one. When we use the weights 
so constructed for the correction of primary posits, we may make mistakes. 

1 G. H. von Wright, in his thesis, * 'The'Logical Problem of Induction,” in Acta Philos. 
Fennica , Part 3 (1941). p. 185, quotes this passage from the German version of this book 
(p. 407) and adds the following criticism: "lieichenbach does not seem to realize that this 
'may expect' is not in the slightest way justified by his system”. Apparently von Wright did 
not see that the passage quoted from me is merely a paraphrase of my inequalities (1) and 
(2), which he does not even mention, and is therefore based on a mathematical proof. It 
should be clear that by the words "may expect” I do not mean that some probability of an 
earlier convergence exists, but that I speak here only of the possibility of an earlier con¬ 
vergence, expressed in the sign ^ occurring in (1) and (2). In some of the preceding passages 
this is stated explicitly. See also the remarks on p. 477 which I have added in this edition. 
Von Wright's misunderstanding of my theory of induction is shown also in his remark that 
my theory was anticipated in C. S. Peirce’s writings. I refer to my remarks about Peirce 
in footnote 1, p. 446. 
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But on the assumption of quasi-convergence we can prove: continuing the 
sequences observed to greater n and increasing their number to greater s, we 
shall reach an n = n 3 and an s — s :i such that our corrections will be right in 
most cases. The values n A and s fi can occur at a stage where a great number 
of primary posits are still incorrect. The system of posits appraised by posits 
finds its justification, therefore, as an instrument for improving the degree 
of convergence—if such improvement is attainable at all. 

When wo connect our results with the considerations of § 85, we see that 
the procedure developed as a method of correction constitutes a simplified 
schema for the logical operations performed in scientific inquiry. The system 
of science, therefore, must be regarded as a system of posits, incorporated 
in the framework of probability logic and including posits of lower and higher 
levels. We cut this system off on a certain level of language; the higher the 
level, the better will be the convergence attained if the system of posits is 
quasi-convergent. 

Although the form of the system is given by probability logic, we know 
that this logic does not require any specific assumptions when probabilities 
are reduced to frequencies, since 4 probability logic is then reducible to two¬ 
valued logic (see method 3, § 81). The result follows because the axioms of 
probability are derivable from the frequency interpretation. The statement 
that the system is quasi-convergent requires, as its only basis, the existence 
of limits of the frequency. The assumption that such limits exist is sufficient, 
not only to justify individual anticipative posits, but also to justify the 
method of correction, that is, the transition to posits of higher levels. 

We have thus reduced scientific method to two presuppositions: the 
validity of the rules of two-valued logic and the assumption that the sequences 
under consideration have a limit of the frequency. The first presupposition 
need not be discussed. In a theory of empirical science the validity of two¬ 
valued logic may be taken for granted. It is the second presupposition, con¬ 
cerning the existence of a limit, that we must now discuss. The use of this 
assumption represents the last obstacle to be removed before a satisfactory 
theory of probability, and thus of scientific method, can be given. 

§ 91. The Justification of Induction 

In traditional philosophy the problem of induction was restricted to the 
discussion of classical induction (§ 67), of the inductive inference referring to 
a sequence consisting throughout of events of the same kind, for which, 
therefore, we have a relative frequency f n = 1. This book always uses the 
wider form of statistical induction, for which the relative frequency /” may 
have any value between 0 and 1. The critical objections that have been raised 
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against classical induction, however, apply likewise to statistical induction, 
and we shall therefore consider these objections. 

The first to criticize the inference of induction by (‘numeration and to 
question its legitimacy was David Hume. 1 Ever since his famous criticism, 
philosophers have regarded the problem of induction as an unsolved riddle 
precluding the completion of an empiricist theory of knowledge. In Hume’s 
analysis it does not appear as a problem of probability; he includes it, rather, 
in the problem of causality. We observe, Hume explains, that equal causes 
are always followed by equal effects. We then infer that the same effects 
will occur in future. On what grounds do we base this inference? Hume’s 
criticism gave two negative answers to the question: 

1. The conclusion of the inductive inference cannot be intern'd a priori , 
that is, it does not follow with logical necessity from the premises; or, in mod¬ 
ern terminology, it is not tautologically implied by the premises. Hume based 
this result on the fact that we can at least imagine that the same causes will 
have another effect tomorrow than they had yesterday, though we do not 
believe it. What is logically impossible cannot be imagined—this psychological 
criterion was employed by Hume for the establishment of his first thesis. 

2. The conclusion of the inductive inference cannot be inferred a posteriori , 
that is, by an argument from experience. Though it is true that the inductive 
inference has been successful in past experience, we cannot infer that it will 
be successful in future experience. The very inference would be an inductive 
inference, and the argument thus would be circular. Its validity presupposes 
the principle that it claims to prove. 

Hume did not see a way out of this dilemma. He regarded the inductive 
inference as an unjustifiable procedure to which we are conditioned by habit 
and the apparent cogency of which must be explained as an outcome of habit. 
The power of habit is so strong that even the clearest insight into the un¬ 
founded use of the inductive inference cannot destroy its compelling character. 
Though this explanation is psychologically true, we cannot admit that it 
has any bearing on the logical problem. Perhaps the inductive inference is a 
habit—the logician wants to know whether it is a good habit. The question 
would call for an answer even if it could be shown that we can never overcome 
the habit. The logical problem of justification must be carefully distinguished 
from the question of psychological laws. 

Up to our day the problem has subsisted in the skeptical version derived 
from Hume, in spite of many attempts at its solution. Kant’s attempt to 
solve the problem by regarding the principle of causality as a synthetic judg¬ 
ment a priori failed because the concept of the synthetic judgment a priori 
was shown to be untenable. I may add that Kant never attempted to make 
use of his theory for a detailed analysis of the inductive inference. In the 

1 An Enquiry Concerning Human Understanding (1748). 
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empiricism of our time the problem has come to the fore, overshadowing all 
other problems of the theory of knowledge. It has held this place persistently 
without changing the skeptical form that Hume gave it. 

A few philosophers tried to escape Hume's skepticism by denying that a 
problem of the justification of induction exists. Various reasons were given 
for such a conception. It was said that the rule of induction does not belong 
to the content of science; that Hume's criticism concerns only a linguistic 
problem; that the problem of justification is a pseudoproblem; and so on. 
It is hardly comprehensible that such arguments could ever have been seri¬ 
ously maintained. They misuse an important modern discovery—the dis¬ 
tinction between levels of language 4 —for the purpose of contesting the legiti¬ 
macy of an old problem, upon which, however, this distinction has no bearing. 

It is true that the rule of induction belongs, not in the 4 object language 
of science, but in the metalanguage. It is a directive for the construction of 
sentences, since it tells how to proceed from verified sentences to predictive 
sentences. I have therefore called it a rule of derivation (§ 87), the only one 
that inductive logic requires in addition to the rules of derivation of deductive 
logic. Such rules, however, are admissible within a scientific language only 
when they can be justified, that is, when they can be shown to be adequate 
means for the purpose of deprivation. Such a justification is easily given for 
the rules of deprivation of deductive logic: it can be shown that the rules 
always lead to true sentences if the premises are true. In a systematic exposi¬ 
tion of deductive logic this justification of the rules of derivation must be 
formally given. 2 For the rule of induction such a proof is not possible; that is 
why the problem of its justification is so involved that it requires a compre¬ 
hensive analysis. 

The frequency theory of probability, with its interpretation of probability 
statements as posits, makes it possible to give a justification of the rule of 
induction. The problem will be discussed with respect to the wider form of 
statistical induction; the results will then include the special case of classical 
induction. The generalization expressed in the use of statistical induction is 
relevant because it weakens the inference. Whereas classical induction wishes 
to establish a rigorous inference that holds for every individual case, statistical 
induction renounces every assertion about the individual case and makes a 
prediction only about the whole sequence. 

There is another sense in which the statistical version involves a different 
interpretation of the problem. The classical conception entails the question 
whether the rule of induction leads to true conclusions, but the statistical 
version deals only with the question whether the rule of induction leads to a 
method of approximation, whether it leads to posits that, when repeated, 
approach the correct result step by step. The answer is that this is so if the 


2 See § 5 above, and ESL , §§ 12, 14. 
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sequences under consideration have a limit of the frequency. The inductive 
posit anticipates the final result (§ 87) and must eventually arrive at the 
correct value of the limit within an interval of exactness. 

The method of anticipation may be illustrated by an example from another 
field. An airplane flies in the fog to a distant destination. From two ground 
stations the pilot receives radio messages about his position, ascertained by 
radio bearings. He then determines the flight direction by means of a map, 
adjusts the compass to the established course, and flies on, keeping continu¬ 
ously to the direction given by the compass. In the fog he has no other 
orientation than to follow the adopted course. After a while, however, he 
inquires again of the ground stations for another determination of his position. 
It turns out that the airplane was subject to a wind drift that has carried 
the ship off its course. The pilot, therefore, establishes a new course that he 
follows thereafter. 

This method, repeatedly applied, is a method of approximation. The direc¬ 
tion from the position ascertained to the destination is not the most favorable 
one because of wind currents; but the pilot does not know the changing cur¬ 
rents and therefore at first posits this direction. He does not believe that he 
has found the final direction. He knows that only when he is very close to 
his destination will the straight line be the most favorable flying direction— 
but he acts as though the coincidence of the most favorable flying direction 
and a straight-line connection were reached. He thus anticipates the final 
result. He may do so because he uses this anticipation only in the sense of a 
posit. By correcting the posit repeatedly, always following the same rule, he 
must finally come to the correct posit and thus reach his destination. 

The analogy with the anticipative method of the rule of induction is 
obvious. In the analysis of Hume’s problem we thus arrive at a preliminary 
result: if a limit of the frequency exists, positing the persistence of the fre¬ 
quency is justified because this method, applied repeatedly, must finally 
lead to true statements. We do not maintain the truth of every individual 
inductive conclusion, but we do not need an assumption of this kind because 
the application of the rule presupposes only its qualification as a method of 
approximation. 

This consideration bases the justification of induction on the assumption 
of the existence of a limit of the frequency. It is obvious, however, that for 
such an assumption no proof can be constructed. When we wish to overcome 
Hume’s skepticism we must eliminate this last assumption from our justifi¬ 
cation of induction. 

The traditional discussion of induction was dominated by the opinion that 
it is impossible to justify induction without an assumption of this kind, that 
is, without an assumption stating a general property of the physical world. 
The supposedly indispensable assumption was formulated as a postulate of 
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the uniformity of nature, expressed, for instance, in the form that similar 
event patterns repeat themselves. The principle used above, that sequences 
of events converge toward a limit of the frequency, may be regarded as an¬ 
other and perhaps more precise version of the uniformity postulate. So long 
as logicians maintained that without a postulate of this kind the inductive 
inference could not be accounted for, and so long as there was no hope of 
proving such a postulate true or probable, the theory of induction was con¬ 
demned to remain an unsolvable puzzle. 

The way out of the difficulty is indicated by the following considerations. 
The insistence on a uniformity postulate derives from an unfortunate attempt 
to construct the theory of inductive inference by analogy with that of deduc¬ 
tive inference—the attempt to supply a premise for the inductive inference 
that would make the latter deductive. It was known that the inductive con¬ 
clusion cannot be asserted as true; but it was hoped to give a demonstrative 
proof, by the addition of such a premise, for the statement that the conclusion 
is probable to a certain degree. Such a proof is dispensable because we can 
assert a statement in the sense of a posit even if we do not know a probability, 
or weight, for it. If the inductive conclusion is recognized as being asserted, 
not as a statement maintained as true or probable, but as an anticipative 
posit, it can be shown that a uniformity postulate is not necessary for the 
derivation of the inductive conclusion. 

We used the assumption of the existence of a limit of the frequency in 
order to prove that, if no probabilities are known, the anticipative posit is 
the best posit because it leads to success in a finite number of steps. With 
respect to the individual act of positing, however, the limit assumption does 
not supply any sort of information. The posit may be wrong, and we can only 
say that if it turns out to be wrong we are willing to correct it and to try 
again. But if the limit assumption is dispensable for every individual posit, 
it can be omitted for the method of positing as a whole. The omission is re¬ 
quired because we have no proof for the assumption. But the absence of proof 
does not mean that we know that there is no limit ; it means only that we do 
not know whether there is a limit. In that case we have as much reason to try 
a posit as in the case that the existence of a limit is known; for, if a limit of 
the frequency exists, we shall find it by the inductive method if only the acts 
of positing are continued sufficiently. Inductive positing in the sense of a 
trial-and-error method is justified so long as it is not known that the attempt 
is hopeless, that there is no limit of the frequency. Should we have no success, 
the positing was useless; but why not take our chance? 

The phrase “take our chance” is not meant here to state that there is a 
certain probability of success; it means only that there is a possibility of suc¬ 
cess in the sense that there is no proof that success is excluded. Furthermore, 
the suggestion to try predictions by means of the inductive method is not an 
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advice of a trial at random, of trying one’s luck, so to speak; it is the proposal 
of a systematic method of trial so devised that if success is attainable the 
method will find it. 

To make the consideration more precise, some auxiliary concepts may be 
introduced. The distinction between necessary and sufficient conditions is 
well known in logit;. A statement c is a necessary condition of a statement a 
if a D c holds, that is, if a cannot be true without c being true. The statement c 
will be a sufficient condition of a if c 3 a holds. For instance, if a physician 
says that an operation is a necessary condition to save the patient, he does 
not say that the operation will save the man; he only says that without the 
operation the patient will die. The operation would be a sufficient condition 
to save the man if it is certain that it will lead to success; but a statement of 
this kind w T ould leave open whether there are other means that would also 
save him. 

These concepts can be applied in the discussion of the anticipative posit. 
If there is a limit of the frequency, the use of the rule of induction will be a 
sufficient condition to find the limit to a desired degree of approximation. 
There may be other methods, but this one, at least, is sufficient. Consequently, 
when we do not know whether there is a limit, we can say, if there is any way 
to find a limit, the rule of induction will be such a way. It is, therefore, a 
necessary condition for the existence of a limit, and thus for the existence 
of a method to find it, that the aim be attainable by means of the rule of 
induction. 

To clarify these logical relations, we shall formulate them in the logical 
symbolism. We abbreviate by a the statement, “There exists a limit of the 
frequency”; by b the statement, “I use the rule of induction in a repeated 
procedure”; by c the statement, “I shall find the limit of the frequency”. 
We then have the relation 8 

a D (b D c) (1) 

This means, b D c is the necessary condition of a, or, in other words, the 
attainability of the aim by the use of the rule of induction is a necessary 
condition of the existence of a limit. Furthermore, if a is true, b is a sufficient 
condition of c. This means, if there is a limit of the frequency, the use of the 
rule of induction is a sufficient instrument to find it. 

It is in this relation that I find the justification of the rule of induction. 
Scientific method pursues the aim of predicting the future; in order to con¬ 
struct a precise formulation for this aim we interpret it as meaning that 
scientific method is intended to find limits of the frequency. Classical induction 
and predictions of individual events are included in the general formulation 

8 The implications occurring here must be regarded as nomological operations: the first as 
a tautological implication, the second as a relative nomological implication. See ESL> § 63. 
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as the special case that the relative frequency is =1. It has been shown 
that if the aim of scientific method is attainable it will be reached by the 
inductive method. This result eliminates the last assumption we had to use 
for the justification of induction. The assumption that there is a limit of the 
frequency must be true if the inductive procedure is to be successful. But we 
need not know whether it is true when we merely ask whether the inductive 
procedure is justified. It is justified as an attempt at finding the limit. Since 
we do not know a sufficient condition to be used for finding a limit, we shall 
at least make use of a necessary condition. In positing according to the rule 
of induction, always correcting the posit when additional observation shows 
different results, we prepare everything so that if there is a limit of the fre¬ 
quency we shall find it. If there is none, we shall certainly not find one—but 
then all other methods will break down also. 

The answer to Hume’s question is thus found. Hume was right in asserting 
that the conclusion of the inductive inference cannot be proved to be true; 
and we may add that it cannot even be proved to be probable. But Hume 
was wrong in stating that the inductive procedure is unjustifiable. It can be 
justified as an instrument that realizes the necessary conditions of prediction, 
to which we resort because sufficient conditions of prediction are beyond our 
reach. The justification of induction can be summarized as follows: 

Thesis 6 . The rule of induction is justified as an instrument of positing 
because it is a method of which we know that if it is possible to make statements 
about the future we shall find them by means of this method. 

This thesis is not meant to say that the inductive rule represents the only 
method of the kind described. In (1, §87) were formulated other forms of 
posits that also must lead to the limit if there is one. Let us investigate 
whether the rule of induction constitutes the best method of finding the limit. 

In order to answer the question we must divide the possible methods in 
tw T o classes. In the first class we put all rules of the form (1, § 87), rules that 
differ from the rule of induction only inasmuch as they include a function c n 
that is formulated explicitly so as to converge to 0 wdtb increasing n. In the 
second class we put all other methods that will lead to the limit of the fre¬ 
quency. These methods will also converge*, asymptotically with the rule of 
induction; but they differ from the form (1, § 87) because they do not state 
the convergence explicitly. 

As to the first class, it was explained in §§ 87-88 that we cannot prove 
the rule of induction to be superior to other methods included in the class. 
There may be, and in general will be, forms of the function c n that are more 
advantageous than the function c n = 0. If we knew one of these forms, we 
should prefer it to the rule of induction. The method of correction (§§ 89-90) 
may be regarded as an instrument for finding such forms. When, on the 
contrary, we know nothing, we can choose what we like. The rule of induction 
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has the advantage of being easier to handle, owing to its descriptive sim¬ 
plicity. Since we are considering a choice among methods all of which will 
lead to the aim, we may let considerations of a technical nature determine 
our choice. 

In regard to the second class the situation is different. If a method is pre¬ 
sented with the assertion that it is a method of this class, the difficulty arises 
of how to prove the assertion. Of course, there may be such a method. Every 
oracle or soothsayer maintains that he has found one. Such a method is usually 
presented in the form of a prediction of individual events. This is included 
in our theory as the case in which the probability, or the frequency limit, is 
= 1. We may, therefore, generalize the problem so as to concern the prediction 
of any value of the limit of the frequency. Assume that a clairvoyant asserts 
that he is able to predict only the probability of an event—to predict the 
limit of the frequency in a sequence. We shall not be willing to believe him 
until we have checked his abilities. He might well be following a method that 
will never lead to the limit of the frequency. Such methods are certainly pos¬ 
sible. For instance, if we were to posit that the limit is outside the interval 
f n ± 5, we should certainly never reach the limit by continued application 
of this rule. The inadequacy of the methods of oracles and soothsayers is 
not so clearly apparent. But how can such methods be tested? 

Obviously, there is only one way—to test these methods by means of the 
rule of induction. We would ask the soothsayer to predict as much as he 
could, and see whether his predictions finally converged sufficiently with the 
frequency observed in the continuation of the sequence. Then we would count 
his success rate. If the latter were sufficiently high, we would infer by the 
rule of induction that it would remain so, and thus conclude that the man 
was an able prophet. If the success rate were low, we would refuse to consult 
him further. It is true that in the latter case the soothsayer may refer us to 
the future, declaring that on continuation of the sequence his prediction of a 
limit may still come true. Although clairvoyants favor such an attitude, 
finally even the most ardent believer no longer places any faith in them. In 
the end the believer submits his judgment to the rule of induction. He must 
do so because the rule of induction is a method of which he knows that it will 
lead to the aim if the aim is attainable, whereas he does not know anything 
about the oracle and the clairvoyant. 

We see, by the way, that with this subordination to induction the oracle 
in all its forms loses its mystical glamor. Like other methods of prediction, 
it is subject to a scientific test. It was explained above that science itself is 
at work to find methods of better convergence by the construction of a net¬ 
work of inductions in the form of the method of correction. There is no need, 
therefore, to ask the help of oracles or clairvoyants in order to improve our 
methods of approximation. 
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We thus come to the result that the rule of induction can by no means be 
maintained to be the best method of approximation. But with its help it is 
possible to find better methods of approximation. Scientific method makes 
use of this fact to a great extent. The concatenation of empirical results in 
a scientific system is the way to improve the method of approximation. The 
rule of induction, or one of its equivalents, is the only method that can be 
used in the test of other methods of approximation, because it is the only 
method of which we know that it represents a method of approximation. 

In discussing the method of correction we have presented methods that 
offer ways to a better approximation. However, a proof that the approxima¬ 
tion will, in fact, be better cannot be given; only the possibility of a better 
approximation can be proved. We have no means of excluding the equality 
sign in the inequalities (1 and 2, § 90) so long as we abstain from the use of 
inductive inferences. When we regard the use of posits of higher levels as a 
better method of convergence, the result must be strictly formulated as 
follows: in employing posits of higher levels we carry through the necessary 
conditions for obtaining a better convergence. The justification of the method 
of correction is therefore given in the same way as that of the inductive 
inference in general. 

A remark about the limit condition must be added. It was stated earlier 
that success by the inductive method is possible only if the sequences under 
consideration have a limit of the frequency. This state an ent requires qualifi¬ 
cation. For the inductive method to have success, it is not necessary that 
all the sequences considered have a limit. It is possible that some have; no 
limit of the frequency and that we shall discover this fact by using other 
sequences that do have limits. Assume, for instance, that in continuing a 
sequence we find its frequency to oscillate between the values \ and §. We 
then re;gard a frequency value close to \ as an event B u and a frequency 
value close to J as an event B 2 . When we now consider the sequence of events 
Bi and B 2 , we may find that it has a limit of the frequency for each of these 
events. Using the rule of induction in the latter sequence, we thus find that 
the former sequence has no limit of the frequency. 

Since it is mathematically possible to construct sequences that have no 
limit of the frequency, it seems reasonable to assume that there are sequences 
of natural events having this property. It is clear, in any case, that we have 
no right to assume that all sequences of natural events have a limit of the 
frequency. The question has been asked whether we cannot prove, at least, 
that there must exist some sequences of natural events having a limit of the 
frequency. A theorem of this kind is presumably demonstrable; that is, it 
seems plausible that, given any system of sequences without a limit, we shall 
always be able to construct from them another sequence that has a limit. 
For the inductive problem, however, the question is irrelevant. In making 



478 


INDUCTION 


inductive posits it does not help to know that there are sequences with a limit 
of the frequency; what we must know is whether the sequence under considera¬ 
tion has one. Since the question cannot he answered a priori , the justification 
of induction must be given independently of a limit assumption, as it is achieved 
in thesis 6 . 

Returning to a consideration of this thesis, we shall analyze the kind of 
justification it affords for induction. To clarify the analysis, we shall refer to 
the distinction between formulations i and 11 of the inductive rule (see p. 450). 
Formulation n postulates the existence of a probability relation between an 
observed frequency and future events, whereas formulation i does not. This 
formulation, we said, expresses the conception that between an observed 
frequency and the statement of probability merely the ground-for-assertion 
relation holds. 

We are now able to give the final explanation why this relation can be 
maintained and thus why the rule of induction can be justified. The title to 
employ the inductive rule is based on a logical relationship between the 
aim of knowledge and a method the applicability of which constitutes a 
necessary condition of success. This relation, which can be formulated only 
in the metalanguage and is expressed in (1), may be abbreviated as the 
condition relation. The justification, therefore, is made possible only because 
we use formulation i, which differs from formulation ir in that it does not 
identify the ground-for-assertion relation with a probability relation. For 
anticipative posits, the ground-for-assertion relation is derived from the 
condition relation. 

Attempts at identifying the ground-for-assertion relation with a probability 
relation, and thus at basing the justification of induction on a probability 
relation, are doomed to failure because they lead to a conception in which 
the rule of induction asserts an object-language relation between an observed 
frequency and future events as soon as the object interpretation of probability 
is used (§ 87). The rule would then constitute a synthetic statement of the 
object language, depending for its support on arguments that are derived 
from a rational belief, from an a priori insight into the structure of the 
physical world—claims that cannot be taken seriously by anyone who is 
accustomed to apply the gauge of scientific truth to logical analysis. It was 
explained above (p. 372) that there is no such thing as synthetic self-evidence; 
self-evidence can be admitted as a criterion of truth only for analytic state¬ 
ments. That adherence to this fundamental principle of empiricism does 
not exclude a justification of induction is shown through thesis 6; by means 
of the condition relation we can construct a justification of induction that is 
free from all forms of synthetic self-evidence. 

The condition relation (1), which is formulated in the implication that 
constitutes the second part of thesis 6, is a tautology; this relation follows 



§ 91. THE JUSTIFICATION OF INDUCTION 479 

from the definition of a limit. Therefore, an empirical assumption is not used 
for the justification; this avoids the fallacy analyzed in Hume’s second result 
(p. 470). But Hume’s first result is not contradicted: it is not maintained 
that the inductive conclusion is tautologically implied by its premise. A syn¬ 
thetic inference is justified by means of a tautology. Such a procedure involves 
no contradiction: whereas the relation between the premise and the conclusion 
of the inductive inference is synthetic, the relation between the inductive 
procedure and the aim of knowledge, that is, the condition relation, is analytic. 
The recognition that a tautological justification of a synthetic inference can 
be given makes the solution of the problem of induction possible. 

This solution presupposes a moderation of the requirements to be satisfied 
by a justification: it involves the renunciation of a proof that the inductive 
conclusion is true or probable. To be sure, it would be a superior justification 
if we could prove that predictions must come true, or that there is some 
probability of their coming true. But if such a proof is impossible we shall be 
glad to have, at least, a method that we know will lead to success if success 
is possible. This logical relation may be illustrated by the example used in 
the Preface. When Magellan planned to find a passage through or around the 
Americas to the Pacific, he did not know whether there was one, and he did 
not even have a probability for the assumption of its existence. But he knew 
that if there was one he would find it by sailing along the coast—so his enter¬ 
prise' was justified. 

Such illustrations suffer from the fact that they belong, not in primitive, 
but in advanced, knowledge. Thus the implication, “If there is a passage he 
will find it by sailing along the coast”, is synthetic and therefore established 
by means of inductive inferences. In other examples the attainability of the 
aim can be judged in terms of probabilities; if the probability of reaching the 
aim turns out to be very low, the realization of the necessary conditions may 
seem scarcely advisable. Such a situation corresponds practically to a case 
where it is known that the aim is not attainable, at least not until further 
conditions are satisfied. Thus if a man wants to be a millionaire he must have 
a bank account; but taking out a bank account is usually not associated with 
the hope of ever being able to write out six-figure checks. 4 In all such instances, 
inductive inferences are applied and the legitimacy of induction is taken for 
granted. Only in the ultimate justification of induction itself must we renounce 
the use of inductive methods; the justification must be given within primitive 
knowledge, and therefore we have no other means at hand than considerations 
(•oncerning necessary conditions, not supported by an estimate of the attain¬ 
ability of the aim. 

4 Similar considerations apply to a problem known as Pascal’s wager, which has sometimes 
been wrongly compared to my justification of induction; see my answer, “On the Justifica¬ 
tion of Induction,” in Jour . of Philos ., Vol. 37 (1940), pp. 101-102. 
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Some critics have called my justification of induction a weak justification. 
Such judgments originate from a rationalistic conception of scientific method. 
In spite of the empiricist trend of modern science, the quest for certainty, 
a product of the theological orientation of philosophy, still survives in the 
assertion that some general truths about the future must be known if scientific 
predictions are to be acceptable. It is hard to see what would be gained by 
the knowledge of such general truths. As was pointed out earlier, if we knew 
for certain that sequences of natural events have limits of the frequency, 
our situation in the face of any individual prediction would not be better 
than it is without such knowledge, since we would never know whether the 
observed initial section of the sequence were long enough to supply a satis¬ 
factory approximation of the frequency. It is no better with other forms of 
the uniformity postulate. How does it help to know that similar event patterns 
repeat themselves, if we do not know whether the pattern under consideration 
is one of them? In view of our ignorance concerning the individual event 
expected, all general truths must appear as illusory supports. 

The aim of knowing the future is unattainable; there is no demonstrative 
truth informing us about future happenings. Let us therefore renounce the 
aim and renounce, too, the critique that measures the attainable in terms 
of that aim. It is not a weak argument that has been constructed. We can 
devise a method that will lead to correct predictions if correct predictions 
can be made—that is ground enough for the application of the method, even 
if we never know, before the occurrence of the event, whether the prediction 
is true. 

If predictive methods cannot supply a knowledge of the future, they are, 
nevertheless, sufficient to justify action. In order to analyze the applicability 
of the inductive method as a basis for action, we must inquire into the pre¬ 
suppositions on which an action depends. 

Every action depends on two presuppositions. The first is of a volitional 
nature: we wish to attain a certain aim. This aim can, at best, be reduced 
to more general volitional aims, but it cannot be given other than volitional 
grounds. A man who likes to exercise may justify his volitional aim by stating 
that he wants to retain a healthy body—but thereby the special volitional 
aim is only reduced to a more general one. The second presupposition is of a 
cognitive nature: we must know what will happen under certain conditions 
in order to be able to judge whether they are adequate for the attainment 
of the aim. If, for instance, I set up the general volitional aim of a healthy 
body, I can derive from this aim the usefulness of athletics only when I know 
that exercise makes the body healthy. Thus for every individual action I must 
know a statement about the future if the action is intended to contribute to 
the achievement of a general volitional aim. Only the combination of the 
two presuppositions, the volitional aim and knowledge about the future, 
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makes purposive action possible. When the physician induces the patient to 
take an anodyne, he must know, first, whether the patient wants to get rid 
of his pain, and second, whether the drug will relieve it. When a politician 
advocates a new law, lie wants to reach some goal and assumes that the law 
will attain it. The two presuppositions for action are of this kind. 

The first presupposition, the volitional decision, need not be discussed 
here. Within the boundaries of a logical analysis we investigate the second 
presupposition for action, that is, the cognitive presupposition. Now it is 
clear that, though the inductive rule does not supply knowledge of a future 
event, it supplies a sufficient reason for action: we are justified in dealing with 
the anticipative posit as true, not because we can expect success in the indi¬ 
vidual case, but because if we can ever act successfully we can do so by 
following the directive of induction. 

The justification of induction constructed may, therefore, be called a prag¬ 
matic justification: it demonstrates the usefulness of the inductive procedure 
for the purpose of acting. It shows that our actions need not depend on a proof 
that the sequences under consideration have the limit property. Actions can 
be made in the sense of trials, and it is sufficient to have a method that will 
lead to successful trials if success is attainable at all. It is true that this 
method has no guaranty of success. But who would dare to ask for such a 
guaranty in the face of the uncertainty of all human planning? The physician 
who operates on a patient because he knows that the operation will be the 
only chance to save the patient will be regarded as justified, though he cannot 
guarantee success. If we cannot base our actions on demonstrative truth, 
we shall welcome it that we can at least take our chance. 

That is a rational argument. But who refers to it when he applies the 
inductive method in everyday life? If asked why he accepts the inductive 
rule, he answers that he believes in it, that he is firmly convinced of its 
validity and simply cannot give up inductive belief. Is there a justification 
for this belief? 

The answer is a definite “No”. The belief cannot be justified. As long as 
such a “No” was averred by a philosophy of skepticism, it constituted a 
negative judgment on all human planning and acting, which it seemed to 
prove utterly useless. It is different for the philosophy of logical analysis, 
which distinguishes between justification of the belief and justification of the 
action. Actions directed by the rule of induction are legitimate attempts at 
success; no form of belief is required for the proof. He who wants to act need 
not believe in success; it is a sufficient reason for action to know how to 
prepare for success, how to be ready for the case that success is attainable. 
Belief in success is a personal addition; whoever has it need not give it up. 
For his actions it is logically irrelevant: whether or not he believes in success, 
the same actions will follow. 



482 


INDUCTION 


I say “logically irrelevant”, for I know very well that, psychologically 
speaking, belief may not be irrelevant. Many a person is not able to act 
according to his posits unless he believes in their success, since few have the 
inner strength to take a possible failure into account and yet pursue their aim. 
Nature seems to have endowed us with the inductive belief as a measure of 
protection, as it were, facilitating our actions, though without it we would 
be equally justified, or obliged, to act. It is difficult, indeed, to free oneself 
from such a belief; and Hume was right when he called the belief in induction 
an unjustified but ineradicable habit. But since Hume could not show that 
oven without this belief action is justified when it follows the rule of induction, 
there remained for him only skeptical resignation. 

The logician need not share this negative attitude. He can show that we 
must act according to the rule of induction even if we cannot believe in it. 
This result may be the reason why it is easier for him to renounce the belief; 
with the loss of the belief he does not at the same time lose his orientation in 
the sphere of action. We do not know whether tomorrow the order of the 
world will not come to an end; tomorrow all known physical laws may be 
invalidated, the sun may no longer shine, and food no longer nourish us—or 
at least our own world may come to an end, because we may close our eyes 
forever. Tomorrow is unknown to us, but this fact need not make any differ¬ 
ence in considerations determining our actions. We adjust our actions to the 
case of a predictable world—if the world is not predictable, very well, then 
we have acted in vain. 

A blind man who has lost his way in the mountains feels a trail with his 
stick. He does not know where the path will lead him, or whether it may 
take him so close to the edge of a precipice that he will be plunged into the 
abyss. Yet he follows the path, groping his way step by step; for if there is 
any possibility of getting out of the wilderness, it is by feeling his way along 
the path. As blind men we face the future; but we feel a path. And we know: 
if we can find a way through the future it is by feeling our way along this path. 
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collectives, 148 
combinable, 105 

combination: class, 34; independence by, 
172; nonindividualized, 156, 157 
comma operation, 399, 402, 410 
common class, 34, 203 
commutativity, 21, 29, 197, 209 
compact sequence, 69, 106, 134, 135 
compensation, probability, 160, 290, 291 
compensation of the dispersion, law of, 195, 
199, 200 

competition of chances, 251 ff. 
complement, rule of the, 60, 119, 120 
complementary class, 34 
complete: disjunctions, 80, 81; independence, 
106; system of probabilities, 106 ff., 115-116 
composition, rule of, 102, 330 
compound: class, 50; mutual probabilities, 
112, 114; statement, 15 
concatenation, 431, 477 
condition: necessary or sufficient, 474; re¬ 
lation, 474, 478, 479 
Condorcet, M. J., 6 


confirmation: inference by, 95, 96, 362, 431, 
442, 456; paradox of, 435 
conjunct, 34, 35 
conjunction, 15 

connective operations, 18, 19, 28 
consistency: of axiomatic system, 41; of in¬ 
ductive logic, 450 

constants, 30, 314; of disintegration, 248 
continuity, axiom of, 207, 209 
continuous: attribute, 205 ff.; probability 
sequences, 237 ff. 
contradiction, 20, 21 
contraposition, 22, 436 
convergence: asymptotic, 447, 452, 453, 455, 
459; of frequency sequences, 262, 278, 
289; measure of, 289; of posits, 466, 467; 
uniform, 347 

convergence theorems; see Bernoulli’s and 
Laplace’s theorems 

convergent: lattice, 172; probabilities, 76; 

quasi-, 468, 469 
coordinative definition, 40, 69 
Copeland, Arthur H., 121, 122, 147, 148, 343 
correction, method of, 461 ff., 475 
counting; see enumeration 
couple: class, 50; conjunct, 34, 35; disjunct, 
34, 35 

coupling, degree of, 205, 403 
critical point of Bernoulli curves, 271 
cross induction, 430, 431, 449, 462 
crucial experiment, 443 
cyclical negation, 420, 421 

Dalkey, Norman, 112, 447 
deduced probabilities, 359, 458 
deductive: hypothetico-, 431, 432; infer¬ 
ence, 23, 24,425,451; logic, 12, 38,350, 371, 
377, 448, 450, 471; probabilities, 457 ff. 
defective operation, 402 
definiendum, 19 
definiens, 19 

definitions, 19; coordinative, 40; explicit, 40; 

implicit, 40; relative, 170 n.; in use, 33 
demarcation value, 390 
density: Bernoulli, 271, 304; probability, 
211, 229 

derivation, rules of, 23 ff., 448, 471 
determinism, 7, 249, 387 n. 
deviation: mean square of, 190; standard, 
190, 221-223, 227; of stellar light, 235, 236 
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diagnosis, differential, 443 
dichotomy, 390, 411 
dilatation surface, 225 
diminishing utility, law of, 186 
directive, 24 

discovery, context of, 433, 434 
disintegration, constant of, 248 
disjunct, 34, 35 

disjunctions, 15; complete, 80, 81; exclusive, 
57 AT., 80, 81, 84, 87, 97, 109, 111, 117; 
inclusive, 15, 82 ff., 100, 110, 114; incom¬ 
plete, 97, 110 ff.; of infinitely many terms, 
183 ff.; many-term, 80, 81 
disjunctive: reference class, 96, 100, 107, 
111, 114; weight, rule of, 111 
dispersion, 190, 195, 227, 283 ff., 289, 292; 
compensation of the, 195, 199, 200; linear, 
190-199; V n -law of the, 199; nonnormal, 
290; normal, 287-291; quadratic, 190, 194, 
197; spreading of, 194; statistical defini¬ 
tion of, 190, 219; subnormal, 288, 289, 
291, 299; supernormal, 288, 289, 291; 
theoretical definition of, 219 
distribution, 208; binomial, 264; multi¬ 
nomial, 264, 362; normal, 214, 221, 228; 
twofold, 21 

distributive reference, rule of, 114 
divergent probabilities, 76, 81 
division; method of, 411; regular, 144 
Ddrge, Karl, 121, 122, 148 
double posit, method of, 373 
drag, probability, 159, 291 
Dubislav, Walter, 387, 407 

Einstein: law of gravitation, 236, 440; theory 
of relativity, 439 
Einstein-I3ose statistics, 158, 360 
elementary statement, 15 
elements of classes, 33 
elimination, rule of, 76-81, 98, 107; con¬ 
tinuous form of, 233; extended, 81, 82 
empiricism, 372, 471, 478 
empty: class, 37, 56, 59, 119; statements, 20 
enumeration: induction by, 329, 351, 429, 
430, 433, 442, 444 ff., 470; lattice, 172; 
with overlapping, 142; by sections, 145 
Epicureans, 387 
equality, logical, 15 
equipossibility, 353 
equiprobability, 353, 355, 357, 358 


equivalence, 15, 16 

errors: observational, 228, 436, 465; theory 
of, 6, 358 

ESL , abbreviation, 12 
evidence: indirect, 94, 95, 12G, 251, 361-364, 
431-434, 438, 455; weight of, 441 
exclusive: disjunctions, 57 ff., 80, 81, 84, 87, 
97, 109, 111, 117; of infinitely many terms, 
183 ff.; of r terms, 80 

existence: determination of, 61, 139; rule of, 
52, 53, 58, 59, 61 n., 71, 72, 75, 119, 139 
141; unspecified, 342 

existential: operator, 27, 51, 59; statements, 
341, 342 

expectation: degree of, 367; mathematical, 
181, 186 

expectational illusion, 368 
explanatory induction, 431, 432 
explicit definition, 40 

exponential: decrease and increase, 243; 

function, 213, 221, 243 
extension, 33 

extensional, 19 n., 403; modalities, 409; se¬ 
quences, 340, 341, 349, 383 

fair sample, 430, 443, 444 

fallacy of incomplete schematization, 96, 354 

fear, 367 

Fermat, P., 5 

finite: attainability, 447; classes, 340, 341; 
means, 322; sequences, 347, 397; systems 
of multivalued languages, 393 
finitization, 348, 382, 383 
first level, probabilities of, 311 ff., 383, 438, 
461 ff. 

fish-pond problem, 250 ff. 

Fisher, R. A., 360, 362, 452, 453, 454, 455, 
459, 461 

formal conception, 25, 39 
forward probabilities, 93 
free variables, 27, 30, 31 
frequency: alternation, 245; convergence of, 
262, 278, 289; curve, 445; dispersion, 283, 
289; interpretation, 72, 178, 345, 366, 371, 
372 ff., 379, 381, 395, 398, 471; lattice, 
277 ff., 284, 462; limit, 68 ff., 72 ff., 135 ff., 
147, 178, 338 ff., 382, 383, 445, 474, 475; 
sequence, 261, 262, 289; statements of 
Bernoulli's theorem, 274 ff. 

Freundlich, E., 235, 236 
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functional, 26 

functions: Bernoulli, 271, 305, 306, 463; cal¬ 
culus of, 26 ff., 31; exponential, 213, 221, 
243; many-place, 28; probability, 208, 228, 
229; propositional, 26, 33; situational, 26 
functor, 51, 59 

fundamental probabilities, 79, 80, 84, 85, 91, 
103, 399, 401 


Galileo’s theory of falling bodies, 433 
gambling, 5, 6, 148-150, 153, 182, 183, 186- 
188, 324, 355 ff., 373, 380 
games of chance, 5, 6, 150, 153, 182, 355 IT., 
373 

Gauss, C. F. W., 6, 213, 272 
generalization, 27 

geometrical probability, 203-205, 207, 355 
Goodman, N., 448 

gravitation: Einstein’s law of, 236, 440; 

Newton’s law of, 432, 433, 438 ff., 465 
Grelling, Kurt, 67 

ground-for-assertion relation, 450, 451, 455, 
460, 478 

Gustin, William, 120, 121 

height of American recruits, 212, 213, 215, 
222 

Heisenberg, W., 9, 249, 250 
Helmer, O., vi, xi, 456, 457, 459 
Hempel, C., vi, xi, 435, 456, 457, 459 
Hertzsprung-Russell diagram of fixed stars, 
216, 218, 229-231 

hierarchy: of posits, 465 ff., 477; of proba¬ 
bilities, 311; see also second level; third 
level 

histogram, 210, 267 

homogeneous: lattice, 172, 314, 325; refer¬ 
ence class, 440, 443 
hope, 368 

Hume, David, vii, viii, 7, 8, 25, 433, 470, 
471, 472, 475, 479, 482 
hypothesis, 455; probability of, 434 ff. 
hypothetical probabilities, 359 
hypothetico-deductive method, 431, 432 

identity, rule of, 21 
ignorance, human, 7, 354, 480 
illusion, expectations!, 368 


implicans and implicate, 16, 28 
implication: alternative, 423, 424; analytic 
connective, 28; class, 36; dissolution of, 
22; general, 27, 28, 46, 50, 57, 377, 434 ff.; 
indeterminate probability, 51, 71, 72; indi¬ 
vidual, 27, 57; individual probability, 57, 
410; inferential, 22; internal probability, 
48, 50, 203, 312; logical, 15, 45, 54, 57, 78, 
117, 118 n., 122, 437; merging of, 22; mu¬ 
tual probability, 109; paradoxes of, 18; 
physical, 28, 57, 377; probability, 45 ff., 
54, 57, 73, 312 ff., 377; probability, of 
high degree, 57, 436; synthetic connective, 
28, 57; tautological, 23; transitivity of, 22, 
78 

implicational notation, 51, 109, 312 
implicit definition, 40 
impossibility, 54, 71, 349, 387, 404 
inclusion of classes, 36, 37 
inclusive disjunctions, 15, 82 ff., 100, 110, 
114 

incomplete disjunctions, 97, 110 ff. 
independence, 64, 67, 80, 91, 102 ff., 115, 
225; by combination, 172; complete, 106; 
of individual facts, 465; nontransitivity of, 
105 

indeterminacy, principle of, 9, 250 
indifference, principle of, 353 ff., 358, 359, 
369-371, 444, 456, 459 
indirect evidence, 94, 95, 126, 254, 361-364, 
431-434, 438, 455 

induction: advanced, 432, 442 ff.; classical, 
352, 469, 470, 474; cross, 430, 431, 449, 
462; by enumeration, 329, 351, 429, 430, 
433, 442, 444 ff., 470; explanatory, 431, 
432; justification of, 25, 470-481; primi¬ 
tive, 444, 445; rule of, 351, 365, 444 ff., 
450, 461, 464, 470-481; statistical, 352, 
469, 471; various forms of, 429 ff. 
inductive: belief, 481, 482; inference, 8, 70, 
326 ff., 351, 352, 431, 444, 451, 470, 473, 
479; logic, 12, 434, 448, 450, 471; proba¬ 
bilities, 457, 458; simplicity, 447 n. 
inequalities: for convergence of posits, 466, 
467; for fundamental probabilities, 79, 80, 
84, 85, 91, 103, 401; of Tchebychev, 192, 
293, 295, 299 

inference, rule of, 23, 24; in probability logic, 
425, 426; see also confirmation; deductive; 
inductive; statistical 
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infinite: classes, 340; means, 322; sequences, 
68, 70, 71, 339-344, 351, 382, 383, 396, 
468; systems of multivalued languages, 
393 

inhomogeneous lattice, 172, 314, 325 
insurance: murder, 254 ff., 432; problems, 6, 
182, 246 

intensional sequences, 340, 341, 343 
interpretation, 25, 39; admissible, 40, 76, 
203, 337, 365, 366; axiom of, 345; object, 
378, 395; see also frequency; logical 
interspersed sequence, 68, 134 
intuitive appraisal, 379, 380 
invariance, domain of, 144, 146-150, 165; see 
also lattice 

inverse probabilities, 93, 327, 361, 442, 455 
isomorphism, 36, 40, 52, 207, 208, 396 

Jeffreys, Harold, 122, 369 
joint class, 34, 203 
Jorgensen, J., 387 

justification: asymptotic, 447, 448; context 
of, 433, 434; of induction, 25, 470-481; 
pragmatic, 481; see also condition relation; 
thesis 0 

Kant, Immanuel, 8, 371, 470 
Kappler, E., 238 
Kendall, M. G., 363 
Kepler, J., 433 

Keynes, John M., 63, 122, 369, 380 
kinetic theory of gases, 6, 7, 174, 199, 228, 
246-247, 360, 443 
Kleene, S. C., 56 

knowledge; see advanced; primitive 
Kolmogoroff, A. N., 121 
Kries, Johannes von, 353 

Langford, C. H., 387 

Laplace, Pierre Simon, 6, 63, 92, 93, 272, 
329, 353; convergence theorem of, 328, 
441, 442, 444, 459 
large numbers, 429, 442 
lattice, 169 ff., 195 ff., 376, 383; Bernoulli, 
306; convergent, 172; enumeration, 172; 
frequency, 277 ff., 284, 462; homogeneous, 
172, 314, 325; inhomogeneous, 172, 314, 
325; invariant, 172-173; of mixture, 173, 
174 

law of nature, 6, 28, 57, 236, 249, 349, 434 


least squares, method of, 236 
Leibniz, Gottfried Wilhelm, 122, 371 
Lewis, C. I., 387 

likelihood, principle of maximum, 363, 364, 
452, 454, 456, 459 

limit, 68, 69; ascertainment of, 70-72, 339 ff., 
445 ff., 472 ff.; and Bernoulli's theorem, 
345, 346, 383; partial, 338, 339, 344-346; 
practical, 288, 347, 348, 447-448; se¬ 
quences without, 448, 477; in subse¬ 
quences, 133, 136, 147, 338 
limit statements: formulation of, 69, 338 ff.; 
meaning of, 340, 344, 345, 347 ff., 351, 
382 ff., 444; verifiability of, 71, 340, 343, 
344, 347, 349, 351, 365, 382, 383, 468 
logarithms as example of probability se¬ 
quence, 343 

logic: deductive, 12, 38, 350, 371, 377, 448, 
450, 471; inductive, 12, 434, 448, 450, 471; 
metrical, 403, 404; of modalities, 403 ff.; 
multivalued, 379, 387,388, 392,420; proba¬ 
bility, 10, 379, 381, 388, 397, 411, 420, 
469; quantitative, 369, 388; three-valued, 
387, 388, 420; two-valued, 17, 379, 381, 
387, 390, 392, 394, 401-403, 411, 412, 468, 
469; of weight, 408-409 
logical: equality, 15; implication, 15, 45, 54, 

57, 117, 118n., 122, 437; interpretation of 
probability, 378; meaning, 349, 382; 
product, 15, 16, 67, 203; sum, 15, 16, 57, 

58, 67, 85, 203; see also impossibility; 
necessity; possibility 

Lukasiewicz, J., 387, 405 

MacColl, Hugh, 387 
McKinsey, J. C. C., 56 
Magellan's voyage, ix, 479 
many-place functions, 28 
many-term disjunctions, 80, 81 
Markoff chains, 159 
material thinking, 25, 83, 84 
mathematical: expectation, 181, 186; for¬ 
malization of probability, 116 ff.; nota¬ 
tion, 51, 60, 116, 117 
mathematics, truth of, 39, 388 
Maxwell: distribution, 228; equations, 440 
Maxwell-Boltzmann statistics, 360 
mean value: statistical definition of, 177; 
theoretical definition of, 180 
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meaning: fictitious, 377; finitist, 349; of 
limit statements ( q.v .); logical, 349, 382; 
physical, 382; probability, 382, 383, 444; 
of probability, 68, 337, 347, 366, 370, 372, 
378; transfer of, 28, 57, 377, 381; verifi¬ 
ability theory of, 340, 344, 348, 382 
measure: of alternation, 246; of convergence, 
289; function, 206; of precision, 214, 221; 
transformation, 355, 359 
members of classes, 33 
metalanguage, 23, 378, 389, 393, 395, 461, 
465; transition to, 390, 412, 413 
metatheorem, 24 
migration space, 238, 240 
Mill, John Stuart, 429, 430, 431 
Mineur, II., 218 

Mises, Richard von, 68, 70, 75, 85, 86, 87, 
88, 121, 122, 132, 147, 150, 168, 208, 228, 
272, 343, 347, 349, 365 
mixture, lattice of, 173, 174 
modalities, 387, 404; extensional, 409; logic 
of, 403 ff.; nomological, 409; physical, 409; 
truth tables of, 406 
mode, 181 

model of an axiomatic system, 41 
modus poriens , 24, 425, 426 
monadic operation, 17 
mountains, probability, 218, 219, 236 
multinomial distribution, 264, 362 
multiplication, theorem of: general, 62 ff., 
74, 107, 119; reduction of, 65 ff.; special, 
63 ff., 102, 142, 145, 146, 149, 364, 453, 
456, 460 

multivalued: languages, 393; logic, 379, 387, 
388, 392, 420 

mutual probability, 107, 108, 109; com¬ 
pound, 112, 114; implication, 109; simple, 
112 

Nagel, E., xi, 436 

nature: law of, 6, 28, 57, 236, 249, 434; uni¬ 
formity of, 199, 473, 480 
necessary condition, 474 
necessity, 20, 377, 387, 404, 409 
negation, 15; cyclical, 420, 421; double, 21; 

quantitative, 420, 421, 423 
negative reference, 78, 108, 113, 114, 116 
Newton’s formula, 264, 265; law of gravita¬ 
tion, 432, 433, 438 ff., 465 


Neyman, J., 360, 363, 364, 452, 453 
nomological: modalities, 409; operations, 
474 n.; statements, 349 n., 409 
nonbound probabilities, 77 
nonexclusive; see inclusive 
nonindividualized combinations, 156, 157 
nonlinear measure transformation, 355, 359 
nonnegative reference, 79, 108, 115 
nonnormal: dispersion, 290; sequence, 139 
normal: clustering, 153; curve, 214; disper¬ 
sion, 287-291; distribution, 214, 221, 228; 
sequence, 131, 144, 147, 148, 151, 154 ff., 
168, 262, 288, 291, 327, 463; sequences in 
narrower sense, 169-171 
normalization: axioms of, 54, 60, 72, 119, 
120; condition of, 212, 214 
null class, 37, 38 

object: interpretation of probability, 378, 
395; language, x, 22, 391, 395 
observational errors, 228, 436, 465; theory of, 
6, 358 

one-scope formulas, 30 
operand, 30, 31-32 

operators: all-, 27, 28, 49, 52; existential, 27; 

negation of, 32; order of, 32-33 
Oppenheim, P., 456, 457, 459 
optimistic persons, 367 
order: axioms of theory of, 136,137; of proba¬ 
bility sequences, 122, 131 ff., 136, 350, 
376; relations for probability, 380 
overlapping, 142 

partial limit, 338, 339, 344-346 
Pascal, B., 5, 152; wager, 479 n. 

Pearson, E. S., 360 
Peirce, C. S., 63, 122, 430, 446, 468 
pessimistic persons, 367 
Petersburg problem, 186-188 
phase probabilities, 133 ff. 
physical: implication, 28, 57, 377; law, 6, 
28, 57, 236, 249, 349, 434; meaning, 382; 
modalities, 409; necessity, 377, 409; pos¬ 
sibility, 349, 409 
place selection, 147 
Poinear6, Henri, 343, 355 
Poisson, S. D., 68; sequences, 296 ff., 307, 
329, 374 

P61ya, Gy orgy, 282, 290 
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posits, 373, 445; anticipateve, 446, 451 fT., 
455, 461, 465, 466, 472-174, 481; ap¬ 
praised, 378, 411, 442, 446, 461; blind, 
446; convergence of, 466, 467; double, 
373; hierarchy of, 465 fT., 477; primary, 
461, 462, 466 fT.; secondary, 461, 462, 
466 fT.; weight of, 378, 379, 461 
possibility, 54, 71, 349, 387, 404, 409 
Post, E. L., 387, 405 
practical limit, 288, 347, 348, 447-448 
precision, measure of, 214, 221 
predicate, 26 

predictional value, 366, 369 371, 460 
primary posits, 461, 462, 466 fT. 
primitive: concept, 369-371; induction, 441, 
445; knowledge, x, 364, 383, 444, 445, 
419, 450, 462, 463; probability sequence, 
208 

product: logical, 15, 16, 67, 203; rule of the, 
90 fT., 107 
property, 26 

propositional: function, 26, 33; sequence, 
380, 396, 397; variables, 23 
propositions, calculus of, 15, 21-22, 26, 31 

quadratic dispersion, 190, 194, 197 
quantification, 27 n. 

quantitative: logic, 369, 388; negation, 420, 
421, 423 

quantum mechanics, 9, 250, 376, 440 
quartile, 181 

quasi-convergent system, 468, 469 
quotation marks, use of, 23, 389, 395, 396 

railway statistics, 199 200 
randomness, 131, 132, 133, 147-151, 349 
rational: belief, 369, 371, 372, 433, 459, 478; 

reconstruction, 380 
rationalism, 371, 433, 480 
reduced probabilities, 99, 100, 104, 153, 253 
reduction, rule of: continuous form of, 234; 

general, 100, 107, 114; special, 97, 99, 107 
reference class, 47, 86, 341; choice of, 374, 
377, 378, 439, 440, 449; disjunctive, 96, 
100, 107, 111, 114; empty, 56, 59, 119; 
homogeneous, 440, 443; and observational 
material, 375, 450, 460; universal class as, 
106, 107 

regular-invariant, 144, 166, 167, 172 
relations, 28 
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relative: definition, 170 n.; probability, 50, 
63, 106; probability function, 228 fT. 
relativity, theory of, 439 
replacement, rule of, 25, 417 
restriction, 112 

retrogressive: interpretation, 369; principle, 
368 

reversal axioms, 61 n. 
rule, definition of, 24 
runs, 153, 154 
Russell, Bertrand, 39, 407 
Russell, H. N. (diagram of fixed stars), 216, 
218, 229-231 

sample, fair, 430, 443, 444 
schematization, incomplete, 96, 354 
Schwartz inequality, 299 
scope of operator, 30 

second level, probabilities of, 311 fT., 318 fT., 
323, 361, 383, 438 fT., 461 fT. 
secondary posits, 461, 462, 466 fT. 
sections, counting by; see enumeration 
selection, 143 fT.; adequate, 365; operation 
of, 399, 402, 410; place, 147 
self-corrective method, 446 
self-evidence: analytic, 371; synthetic, 371, 
372, 433, 478 

semi convergent sequence, 347 
shift theorem, 193 
simple mutual probabilities, 112 
simplest curve, 236 

simplicity, descriptive and inductive, 447 n. 
single case, probability of, 338, 366 fT., 371, 
375, 376, 458, 465; frequency interpreta¬ 
tion of, 372 fT. 
situational function, 26 
skepticism, 470, 471, 472, 481, 482 
soothsayer’s predictions, 476 
space totality, 174 
specialization, 26 
spectral type of stars, 232, 235 
Sjnelra umtheorie , 353 
standard deviation, 190, 221-223, 227 
star statistics, 216-219, 229-232, 235, 236 
statistical: inference, 360-361, 453, 455; 

probabilities, 359 
Sternberg, W., 346 
Stirling’s formula, 266 
Stoics, 387 
StumpfT, C., 369 
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subaltemation, formulas of, 32 
subnormal dispersion, 288, 289, 291, 299 
subsequences, limit in, 133, 136, 147, 338 
substitution, rule of, 23; addition to, 417 
succession, rule of, 332, 376, 444 
sufficient condition, 474 
sum, logical, 15, 16, 57, 58, 67, 85, 203 
supernormal dispersion, 288, 289, 291 
symmetry: inference of, 359; of premises, 22 
synthetic: a prion , 371, 372, 459, 470; con¬ 
nective implication, 28, 57; inference, 448, 
479; self-evidence, 371, 372, 433, 478; 
statements, 20 

target as illustration, 47, 205-206, 209-210, 
218, 225, 237, 390 
Tarski, A., 387, 405 

tautologies, 20 ff., 31, 39, 388; lists of, 21-22, 
31-33; in probability logic, 413 ff., 423 
Tchebychev’s inequality, 192, 293, 295, 299 
tertium non datur , 21, 387, 394 
thermodynamics, second principle of, 6 
thesis 0, 475 

third level, probabilities of, 313 
three-valued logic, 387, 388, 420 
time totality, 174 
Tornier, E., 61, 121, 365 
transfer: coefficients of, 242; degree of, 164; 
of meaning, 28, 57, 377, 381; probability, 
159 ff., 161, 162, 291, 463 
transition: to metalanguage, 390, 412, 413; 

to two-valued logic, 390, 411 
transitivity, 22, 78, 105, 113 
translation, rule of, 49; addition to, 171 
triangle, rule of the, 112, 113 
trichotomy, 411 


truth-functional operations, 18 
truth tables: of logic of modalities, 406; of 
logic of weight, 408; of probability logic, 
400, 421, 424; of two-valued logic, 17 
truth value of a statement, 17, 178, 379 
twofold distribution, 21 
two-valued logic, 17, 379, 381, 387, 390, 392, 
394, 401-403, 411, 412, 468, 469; transi¬ 
tion to, 390, 411 

uniformity of nature, 199, 473, 480 
universal class, 37, 38, 106, 107 
univocality, axiom of, 53-55, 59, 72,117,119 
unspecified existence, 342 

variables: bound, 27; free, 27, 30, 31; func¬ 
tional, 26, 51; propositional, 23 
variance, 190 
Venn, John, 122 

verifiability: individual, 389; of limit state¬ 
ments, 71, 340, 343, 344, 347, 349, 351, 
365, 382, 383, 468; nonindividual, 395; of 
single-case probabilities, 371; theory of 
meaning, 340, 344, 348, 382; unilateral, 
342, 345 

Ville, Jean, 148, 149 
volitional aim, 480 

wager, 149, 373, 380 

Wald, A., 148, 149, 360, 454 

weight: disjunctive, 111; of evidence, 441; 

logic of, 408, 409; of posits, 378, 379, 461 
Wright, G. H. von, 468 

Zawirski, S., 387 
Zilsel, Edgar, 337 
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